The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
!pip install kaggle
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12) Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0) Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0) Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2021.10.8) Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.64.0) Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (6.1.2) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.2) Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3) Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)
!pwd
/content
!mkdir ~/.kaggle
!cp /Users/shruthigutta/Downloads/kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
mkdir: cannot create directory ‘/root/.kaggle’: File exists cp: cannot stat '/Users/shruthigutta/Downloads/kaggle.json': No such file or directory chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
! kaggle competitions files home-credit-default-risk
Traceback (most recent call last):
File "/usr/local/bin/kaggle", line 5, in <module>
from kaggle.cli import main
File "/usr/local/lib/python3.7/dist-packages/kaggle/__init__.py", line 23, in <module>
api.authenticate()
File "/usr/local/lib/python3.7/dist-packages/kaggle/api/kaggle_api_extended.py", line 166, in authenticate
self.config_file, self.config_dir))
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
# 
Create a base directory:
DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIRDATA_DIR = "/content/drive/MyDrive/home-credit-default-risk"
#DATA_DIR="/content/drive" #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
#mkdir $DATA_DIR
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
#!ls -l $DATA_DIR
#! kaggle competitions download home-credit-default-risk -p $DATA_DIR
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
unzippingReq = False
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile('application_train.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('application_test.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('credit_card_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('installments_payments.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('POS_CASH_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('previous_application.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
#print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
# from google.colab import drive
# drive.mount('/content/drive')
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121)
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
application_test: shape is (48744, 121)
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
bureau: shape is (1716428, 17)
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_balance: shape is (27299925, 3)
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
credit_card_balance: shape is (3840312, 23)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
installments_payments: shape is (13605401, 8)
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
previous_application: shape is (1339313, 37)
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
POS_CASH_balance: shape is (8278292, 8)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31.0 | 48.0 | 45.0 | Active | 0.0 | 0.0 |
| 1 | 1715348 | 367990 | -33.0 | 36.0 | 35.0 | Active | 0.0 | 0.0 |
| 2 | 1784872 | 397406 | -32.0 | 12.0 | 9.0 | Active | 0.0 | 0.0 |
| 3 | 1903291 | 269225 | -35.0 | 48.0 | 42.0 | Active | 0.0 | 0.0 |
| 4 | 2341044 | 334279 | -35.0 | 36.0 | 35.0 | Active | 0.0 | 0.0 |
CPU times: user 43 s, sys: 5.99 s, total: 49 s Wall time: 1min 7s
print('\033[1m' + "Size of each dataset : " + '\033[0m' , end = '\n' * 2)
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]:4}]')
Size of each dataset :
dataset application_test : [ 3,850, 121]
dataset application_train : [ 307,511, 122]
dataset bureau : [ 962,890, 17]
dataset bureau_balance : [ 27,299,925, 3]
dataset credit_card_balance : [ 592,349, 23]
dataset installments_payments : [ 11,383,847, 8]
dataset previous_application : [ 46,986, 37]
dataset POS_CASH_balance : [ 10,001,358, 8]
(datasets['application_train'].dtypes).unique()
array([dtype('int64'), dtype('O'), dtype('float64')], dtype=object)
from IPython.display import display, HTML
pd.set_option("display.max_rows", None, "display.max_columns", None)
# Full stats
def stats_summary1(df, df_name):
print(datasets[df_name].info(verbose=True, null_counts=True ))
print("-----"*15)
print(f"Shape of the df {df_name} is {df.shape} \n")
print("-----"*15)
print(f"Statistical summary of {df_name} is :")
print("-----"*15)
print(f"Description of the df {df_name}:\n")
print(display(HTML(np.round(datasets['application_train'].describe(),2).to_html())))
#print(f"Description of the df {df_name}:\n",np.round(datasets['application_train'].describe(),2))
def stats_summary2(df, df_name):
print(f"Description of the df continued for {df_name}:\n")
print("-----"*15)
print("Data type value counts: \n",df.dtypes.value_counts())
print("\nReturn number of unique elements in the object. \n")
print(df.select_dtypes('object').apply(pd.Series.nunique, axis = 0))
# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
print("-----"*15)
print(f"Categorical and Numerical(int + float) features of {df_name}.")
print("-----"*15)
print()
for k, v in df_dtypes.items():
print({k.name: v})
print("---"*10)
print("\n \n")
# Null data list and plot.
def null_data_plot(df, df_name):
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = df.isna().sum().sort_values(ascending = False)
missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_data=missing_data[missing_data['Percent'] > 0]
print("-----"*15)
print("-----"*15)
print('\n The Missing Data: \n')
# display(missing_data) # display few
if len(missing_data)==0:
print("No missing Data")
else:
display(HTML(missing_data.to_html())) # display all the rows
print("-----"*15)
if len(df.columns)> 35:
f,ax =plt.subplots(figsize=(8,15))
else:
f,ax =plt.subplots()
#plt.xticks(rotation='90')
#fig=sns.barplot(missing_data.index, missing_data["Percent"],alpha=0.8)
#plt.xlabel('Features', fontsize=15)
#plt.ylabel('Percent of missing values', fontsize=15)
plt.title(f'Percent missing data for {df_name}.', fontsize=10)
fig=sns.barplot(missing_data["Percent"],missing_data.index ,alpha=0.8)
plt.xlabel('Percent of missing values', fontsize=10)
plt.ylabel('Features', fontsize=10)
return missing_data
# Full consolidation of all the stats function.
def display_stats(df, df_name):
print("--"*40)
print(" "*20 + '\033[1m'+ df_name + '\033[0m' +" "*20)
print("--"*40)
stats_summary1(df, df_name)
def display_feature_info(df, df_name):
stats_summary2(df, df_name)
feature_datatypes_groups(df, df_name)
null_data_plot(df, df_name)
display_stats(datasets['application_train'], 'application_train')
--------------------------------------------------------------------------------
application_train
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_CURR 307511 non-null int64
1 TARGET 307511 non-null int64
2 NAME_CONTRACT_TYPE 307511 non-null object
3 CODE_GENDER 307511 non-null object
4 FLAG_OWN_CAR 307511 non-null object
5 FLAG_OWN_REALTY 307511 non-null object
6 CNT_CHILDREN 307511 non-null int64
7 AMT_INCOME_TOTAL 307511 non-null float64
8 AMT_CREDIT 307511 non-null float64
9 AMT_ANNUITY 307499 non-null float64
10 AMT_GOODS_PRICE 307233 non-null float64
11 NAME_TYPE_SUITE 306219 non-null object
12 NAME_INCOME_TYPE 307511 non-null object
13 NAME_EDUCATION_TYPE 307511 non-null object
14 NAME_FAMILY_STATUS 307511 non-null object
15 NAME_HOUSING_TYPE 307511 non-null object
16 REGION_POPULATION_RELATIVE 307511 non-null float64
17 DAYS_BIRTH 307511 non-null int64
18 DAYS_EMPLOYED 307511 non-null int64
19 DAYS_REGISTRATION 307511 non-null float64
20 DAYS_ID_PUBLISH 307511 non-null int64
21 OWN_CAR_AGE 104582 non-null float64
22 FLAG_MOBIL 307511 non-null int64
23 FLAG_EMP_PHONE 307511 non-null int64
24 FLAG_WORK_PHONE 307511 non-null int64
25 FLAG_CONT_MOBILE 307511 non-null int64
26 FLAG_PHONE 307511 non-null int64
27 FLAG_EMAIL 307511 non-null int64
28 OCCUPATION_TYPE 211120 non-null object
29 CNT_FAM_MEMBERS 307509 non-null float64
30 REGION_RATING_CLIENT 307511 non-null int64
31 REGION_RATING_CLIENT_W_CITY 307511 non-null int64
32 WEEKDAY_APPR_PROCESS_START 307511 non-null object
33 HOUR_APPR_PROCESS_START 307511 non-null int64
34 REG_REGION_NOT_LIVE_REGION 307511 non-null int64
35 REG_REGION_NOT_WORK_REGION 307511 non-null int64
36 LIVE_REGION_NOT_WORK_REGION 307511 non-null int64
37 REG_CITY_NOT_LIVE_CITY 307511 non-null int64
38 REG_CITY_NOT_WORK_CITY 307511 non-null int64
39 LIVE_CITY_NOT_WORK_CITY 307511 non-null int64
40 ORGANIZATION_TYPE 307511 non-null object
41 EXT_SOURCE_1 134133 non-null float64
42 EXT_SOURCE_2 306851 non-null float64
43 EXT_SOURCE_3 246546 non-null float64
44 APARTMENTS_AVG 151450 non-null float64
45 BASEMENTAREA_AVG 127568 non-null float64
46 YEARS_BEGINEXPLUATATION_AVG 157504 non-null float64
47 YEARS_BUILD_AVG 103023 non-null float64
48 COMMONAREA_AVG 92646 non-null float64
49 ELEVATORS_AVG 143620 non-null float64
50 ENTRANCES_AVG 152683 non-null float64
51 FLOORSMAX_AVG 154491 non-null float64
52 FLOORSMIN_AVG 98869 non-null float64
53 LANDAREA_AVG 124921 non-null float64
54 LIVINGAPARTMENTS_AVG 97312 non-null float64
55 LIVINGAREA_AVG 153161 non-null float64
56 NONLIVINGAPARTMENTS_AVG 93997 non-null float64
57 NONLIVINGAREA_AVG 137829 non-null float64
58 APARTMENTS_MODE 151450 non-null float64
59 BASEMENTAREA_MODE 127568 non-null float64
60 YEARS_BEGINEXPLUATATION_MODE 157504 non-null float64
61 YEARS_BUILD_MODE 103023 non-null float64
62 COMMONAREA_MODE 92646 non-null float64
63 ELEVATORS_MODE 143620 non-null float64
64 ENTRANCES_MODE 152683 non-null float64
65 FLOORSMAX_MODE 154491 non-null float64
66 FLOORSMIN_MODE 98869 non-null float64
67 LANDAREA_MODE 124921 non-null float64
68 LIVINGAPARTMENTS_MODE 97312 non-null float64
69 LIVINGAREA_MODE 153161 non-null float64
70 NONLIVINGAPARTMENTS_MODE 93997 non-null float64
71 NONLIVINGAREA_MODE 137829 non-null float64
72 APARTMENTS_MEDI 151450 non-null float64
73 BASEMENTAREA_MEDI 127568 non-null float64
74 YEARS_BEGINEXPLUATATION_MEDI 157504 non-null float64
75 YEARS_BUILD_MEDI 103023 non-null float64
76 COMMONAREA_MEDI 92646 non-null float64
77 ELEVATORS_MEDI 143620 non-null float64
78 ENTRANCES_MEDI 152683 non-null float64
79 FLOORSMAX_MEDI 154491 non-null float64
80 FLOORSMIN_MEDI 98869 non-null float64
81 LANDAREA_MEDI 124921 non-null float64
82 LIVINGAPARTMENTS_MEDI 97312 non-null float64
83 LIVINGAREA_MEDI 153161 non-null float64
84 NONLIVINGAPARTMENTS_MEDI 93997 non-null float64
85 NONLIVINGAREA_MEDI 137829 non-null float64
86 FONDKAPREMONT_MODE 97216 non-null object
87 HOUSETYPE_MODE 153214 non-null object
88 TOTALAREA_MODE 159080 non-null float64
89 WALLSMATERIAL_MODE 151170 non-null object
90 EMERGENCYSTATE_MODE 161756 non-null object
91 OBS_30_CNT_SOCIAL_CIRCLE 306490 non-null float64
92 DEF_30_CNT_SOCIAL_CIRCLE 306490 non-null float64
93 OBS_60_CNT_SOCIAL_CIRCLE 306490 non-null float64
94 DEF_60_CNT_SOCIAL_CIRCLE 306490 non-null float64
95 DAYS_LAST_PHONE_CHANGE 307510 non-null float64
96 FLAG_DOCUMENT_2 307511 non-null int64
97 FLAG_DOCUMENT_3 307511 non-null int64
98 FLAG_DOCUMENT_4 307511 non-null int64
99 FLAG_DOCUMENT_5 307511 non-null int64
100 FLAG_DOCUMENT_6 307511 non-null int64
101 FLAG_DOCUMENT_7 307511 non-null int64
102 FLAG_DOCUMENT_8 307511 non-null int64
103 FLAG_DOCUMENT_9 307511 non-null int64
104 FLAG_DOCUMENT_10 307511 non-null int64
105 FLAG_DOCUMENT_11 307511 non-null int64
106 FLAG_DOCUMENT_12 307511 non-null int64
107 FLAG_DOCUMENT_13 307511 non-null int64
108 FLAG_DOCUMENT_14 307511 non-null int64
109 FLAG_DOCUMENT_15 307511 non-null int64
110 FLAG_DOCUMENT_16 307511 non-null int64
111 FLAG_DOCUMENT_17 307511 non-null int64
112 FLAG_DOCUMENT_18 307511 non-null int64
113 FLAG_DOCUMENT_19 307511 non-null int64
114 FLAG_DOCUMENT_20 307511 non-null int64
115 FLAG_DOCUMENT_21 307511 non-null int64
116 AMT_REQ_CREDIT_BUREAU_HOUR 265992 non-null float64
117 AMT_REQ_CREDIT_BUREAU_DAY 265992 non-null float64
118 AMT_REQ_CREDIT_BUREAU_WEEK 265992 non-null float64
119 AMT_REQ_CREDIT_BUREAU_MON 265992 non-null float64
120 AMT_REQ_CREDIT_BUREAU_QRT 265992 non-null float64
121 AMT_REQ_CREDIT_BUREAU_YEAR 265992 non-null float64
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
---------------------------------------------------------------------------
Shape of the df application_train is (307511, 122)
---------------------------------------------------------------------------
Statistical summary of application_train is :
---------------------------------------------------------------------------
Description of the df application_train:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['application_train'], 'application_train')
Description of the df continued for application_train:
---------------------------------------------------------------------------
Data type value counts:
float64 65
int64 41
object 16
dtype: int64
Return number of unique elements in the object.
NAME_CONTRACT_TYPE 2
CODE_GENDER 3
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 8
NAME_EDUCATION_TYPE 5
NAME_FAMILY_STATUS 6
NAME_HOUSING_TYPE 6
OCCUPATION_TYPE 18
WEEKDAY_APPR_PROCESS_START 7
ORGANIZATION_TYPE 58
FONDKAPREMONT_MODE 4
HOUSETYPE_MODE 3
WALLSMATERIAL_MODE 7
EMERGENCYSTATE_MODE 2
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of application_train.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21'],
dtype='object')}
------------------------------
{'float64': Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE',
'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
| BASEMENTAREA_MEDI | 58.52 | 179943 |
| BASEMENTAREA_AVG | 58.52 | 179943 |
| BASEMENTAREA_MODE | 58.52 | 179943 |
| EXT_SOURCE_1 | 56.38 | 173378 |
| NONLIVINGAREA_MODE | 55.18 | 169682 |
| NONLIVINGAREA_AVG | 55.18 | 169682 |
| NONLIVINGAREA_MEDI | 55.18 | 169682 |
| ELEVATORS_MEDI | 53.30 | 163891 |
| ELEVATORS_AVG | 53.30 | 163891 |
| ELEVATORS_MODE | 53.30 | 163891 |
| WALLSMATERIAL_MODE | 50.84 | 156341 |
| APARTMENTS_MEDI | 50.75 | 156061 |
| APARTMENTS_AVG | 50.75 | 156061 |
| APARTMENTS_MODE | 50.75 | 156061 |
| ENTRANCES_MEDI | 50.35 | 154828 |
| ENTRANCES_AVG | 50.35 | 154828 |
| ENTRANCES_MODE | 50.35 | 154828 |
| LIVINGAREA_AVG | 50.19 | 154350 |
| LIVINGAREA_MODE | 50.19 | 154350 |
| LIVINGAREA_MEDI | 50.19 | 154350 |
| HOUSETYPE_MODE | 50.18 | 154297 |
| FLOORSMAX_MODE | 49.76 | 153020 |
| FLOORSMAX_MEDI | 49.76 | 153020 |
| FLOORSMAX_AVG | 49.76 | 153020 |
| YEARS_BEGINEXPLUATATION_MODE | 48.78 | 150007 |
| YEARS_BEGINEXPLUATATION_MEDI | 48.78 | 150007 |
| YEARS_BEGINEXPLUATATION_AVG | 48.78 | 150007 |
| TOTALAREA_MODE | 48.27 | 148431 |
| EMERGENCYSTATE_MODE | 47.40 | 145755 |
| OCCUPATION_TYPE | 31.35 | 96391 |
| EXT_SOURCE_3 | 19.83 | 60965 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_DAY | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_MON | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_QRT | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 13.50 | 41519 |
| NAME_TYPE_SUITE | 0.42 | 1292 |
| OBS_30_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| DEF_30_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| OBS_60_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| DEF_60_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| EXT_SOURCE_2 | 0.21 | 660 |
| AMT_GOODS_PRICE | 0.09 | 278 |
---------------------------------------------------------------------------
datasets["application_train"]['DAYS_EMPLOYED'].describe()
count 307511.000000 mean 63815.045904 std 141275.766519 min -17912.000000 25% -2760.000000 50% -1213.000000 75% -289.000000 max 365243.000000 Name: DAYS_EMPLOYED, dtype: float64
anom_days_employed = datasets["application_train"][datasets["application_train"]['DAYS_EMPLOYED']==365243]
norm_days_employed = datasets["application_train"][datasets["application_train"]['DAYS_EMPLOYED']!=365243]
print(anom_days_employed.shape)
dr_anom = anom_days_employed['TARGET'].mean()*100
dr_norm = norm_days_employed['TARGET'].mean()*100
print('Default rate (Anomaly): {:.2f}'.format(dr_anom))
print('Default rate (Normal): {:.2f}'.format(dr_norm))
pct_anom_days_employed = (anom_days_employed.shape[0]/datasets["application_train"].shape[0])*100
print(pct_anom_days_employed)
(55374, 122) Default rate (Anomaly): 5.40 Default rate (Normal): 8.66 18.00716071945394
plt.hist(datasets["application_train"]['OWN_CAR_AGE'],edgecolor = 'k', bins = 25)
plt.title('OWN CAR AGE'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count');
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True)
def generic_xy_boxplot(xaxisfeature,yaxisfeature,legendcategory,data,log_scale):
sns.boxplot(xaxisfeature, yaxisfeature, hue = legendcategory, data = data)
plt.title('Boxplot for '+ xaxisfeature +' with ' + yaxisfeature+' and '+legendcategory,fontsize=10)
if log_scale:
plt.yscale('log')
plt.ylabel(f'{yaxisfeature} (log Scale)')
plt.tight_layout()
def box_plot(plots):
number_of_subplots = len(plots)
plt.figure(figsize = (20,8))
sns.set_style('whitegrid')
for i, ele in enumerate(plots):
plt.subplot(1, number_of_subplots, i + 1)
plt.subplots_adjust(wspace=0.25)
xaxisfeature=ele[0]
yaxisfeature=ele[1]
legendcategory=ele[2]
data=ele[3]
log_scale=ele[4]
generic_xy_boxplot(xaxisfeature,yaxisfeature,legendcategory,data,log_scale)
plots=[['NAME_CONTRACT_TYPE','AMT_CREDIT','CODE_GENDER',datasets['application_train'],False]]
box_plot(plots)
display_stats(datasets['bureau'], 'bureau')
--------------------------------------------------------------------------------
bureau
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962890 entries, 0 to 962889
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_CURR 962890 non-null int64
1 SK_ID_BUREAU 962890 non-null int64
2 CREDIT_ACTIVE 962890 non-null object
3 CREDIT_CURRENCY 962890 non-null object
4 DAYS_CREDIT 962889 non-null float64
5 CREDIT_DAY_OVERDUE 962889 non-null float64
6 DAYS_CREDIT_ENDDATE 903124 non-null float64
7 DAYS_ENDDATE_FACT 605644 non-null float64
8 AMT_CREDIT_MAX_OVERDUE 328576 non-null float64
9 CNT_CREDIT_PROLONG 962889 non-null float64
10 AMT_CREDIT_SUM 962883 non-null float64
11 AMT_CREDIT_SUM_DEBT 818180 non-null float64
12 AMT_CREDIT_SUM_LIMIT 628041 non-null float64
13 AMT_CREDIT_SUM_OVERDUE 962889 non-null float64
14 CREDIT_TYPE 962889 non-null object
15 DAYS_CREDIT_UPDATE 962889 non-null float64
16 AMT_ANNUITY 292988 non-null float64
dtypes: float64(12), int64(2), object(3)
memory usage: 124.9+ MB
None
---------------------------------------------------------------------------
Shape of the df bureau is (962890, 17)
---------------------------------------------------------------------------
Statistical summary of bureau is :
---------------------------------------------------------------------------
Description of the df bureau:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['bureau'], 'bureau')
Description of the df continued for bureau:
---------------------------------------------------------------------------
Data type value counts:
float64 12
object 3
int64 2
dtype: int64
Return number of unique elements in the object.
CREDIT_ACTIVE 4
CREDIT_CURRENCY 5
CREDIT_TYPE 15
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of bureau.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_CURR', 'SK_ID_BUREAU'], dtype='object')}
------------------------------
{'float64': Index(['DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE',
'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG',
'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
'AMT_CREDIT_SUM_OVERDUE', 'DAYS_CREDIT_UPDATE', 'AMT_ANNUITY'],
dtype='object')}
------------------------------
{'object': Index(['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE'], dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| AMT_ANNUITY | 69.57 | 669902 |
| AMT_CREDIT_MAX_OVERDUE | 65.88 | 634314 |
| DAYS_ENDDATE_FACT | 37.10 | 357246 |
| AMT_CREDIT_SUM_LIMIT | 34.78 | 334849 |
| AMT_CREDIT_SUM_DEBT | 15.03 | 144710 |
| DAYS_CREDIT_ENDDATE | 6.21 | 59766 |
---------------------------------------------------------------------------
display_stats(datasets['bureau_balance'], 'bureau_balance')
--------------------------------------------------------------------------------
bureau_balance
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_BUREAU 27299925 non-null int64
1 MONTHS_BALANCE 27299925 non-null int64
2 STATUS 27299925 non-null object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None
---------------------------------------------------------------------------
Shape of the df bureau_balance is (27299925, 3)
---------------------------------------------------------------------------
Statistical summary of bureau_balance is :
---------------------------------------------------------------------------
Description of the df bureau_balance:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['bureau_balance'], 'bureau_balance')
Description of the df continued for bureau_balance:
---------------------------------------------------------------------------
Data type value counts:
int64 2
object 1
dtype: int64
Return number of unique elements in the object.
STATUS 8
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of bureau_balance.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_BUREAU', 'MONTHS_BALANCE'], dtype='object')}
------------------------------
{'object': Index(['STATUS'], dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
No missing Data
display_stats(datasets['credit_card_balance'], 'credit_card_balance')
--------------------------------------------------------------------------------
credit_card_balance
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 592349 entries, 0 to 592348
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 592349 non-null int64
1 SK_ID_CURR 592349 non-null int64
2 MONTHS_BALANCE 592349 non-null int64
3 AMT_BALANCE 592349 non-null float64
4 AMT_CREDIT_LIMIT_ACTUAL 592349 non-null int64
5 AMT_DRAWINGS_ATM_CURRENT 469240 non-null float64
6 AMT_DRAWINGS_CURRENT 592348 non-null float64
7 AMT_DRAWINGS_OTHER_CURRENT 469240 non-null float64
8 AMT_DRAWINGS_POS_CURRENT 469240 non-null float64
9 AMT_INST_MIN_REGULARITY 549688 non-null float64
10 AMT_PAYMENT_CURRENT 467989 non-null float64
11 AMT_PAYMENT_TOTAL_CURRENT 592348 non-null float64
12 AMT_RECEIVABLE_PRINCIPAL 592348 non-null float64
13 AMT_RECIVABLE 592348 non-null float64
14 AMT_TOTAL_RECEIVABLE 592348 non-null float64
15 CNT_DRAWINGS_ATM_CURRENT 469240 non-null float64
16 CNT_DRAWINGS_CURRENT 592348 non-null float64
17 CNT_DRAWINGS_OTHER_CURRENT 469240 non-null float64
18 CNT_DRAWINGS_POS_CURRENT 469240 non-null float64
19 CNT_INSTALMENT_MATURE_CUM 549688 non-null float64
20 NAME_CONTRACT_STATUS 592348 non-null object
21 SK_DPD 592348 non-null float64
22 SK_DPD_DEF 592348 non-null float64
dtypes: float64(18), int64(4), object(1)
memory usage: 103.9+ MB
None
---------------------------------------------------------------------------
Shape of the df credit_card_balance is (592349, 23)
---------------------------------------------------------------------------
Statistical summary of credit_card_balance is :
---------------------------------------------------------------------------
Description of the df credit_card_balance:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['credit_card_balance'], 'credit_card_balance')
Description of the df continued for credit_card_balance:
---------------------------------------------------------------------------
Data type value counts:
float64 18
int64 4
object 1
dtype: int64
Return number of unique elements in the object.
NAME_CONTRACT_STATUS 7
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of credit_card_balance.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE',
'AMT_CREDIT_LIMIT_ACTUAL'],
dtype='object')}
------------------------------
{'float64': Index(['AMT_BALANCE', 'AMT_DRAWINGS_ATM_CURRENT', 'AMT_DRAWINGS_CURRENT',
'AMT_DRAWINGS_OTHER_CURRENT', 'AMT_DRAWINGS_POS_CURRENT',
'AMT_INST_MIN_REGULARITY', 'AMT_PAYMENT_CURRENT',
'AMT_PAYMENT_TOTAL_CURRENT', 'AMT_RECEIVABLE_PRINCIPAL',
'AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE', 'CNT_DRAWINGS_ATM_CURRENT',
'CNT_DRAWINGS_CURRENT', 'CNT_DRAWINGS_OTHER_CURRENT',
'CNT_DRAWINGS_POS_CURRENT', 'CNT_INSTALMENT_MATURE_CUM', 'SK_DPD',
'SK_DPD_DEF'],
dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_STATUS'], dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| AMT_PAYMENT_CURRENT | 20.99 | 124360 |
| AMT_DRAWINGS_ATM_CURRENT | 20.78 | 123109 |
| CNT_DRAWINGS_POS_CURRENT | 20.78 | 123109 |
| AMT_DRAWINGS_OTHER_CURRENT | 20.78 | 123109 |
| AMT_DRAWINGS_POS_CURRENT | 20.78 | 123109 |
| CNT_DRAWINGS_OTHER_CURRENT | 20.78 | 123109 |
| CNT_DRAWINGS_ATM_CURRENT | 20.78 | 123109 |
| CNT_INSTALMENT_MATURE_CUM | 7.20 | 42661 |
| AMT_INST_MIN_REGULARITY | 7.20 | 42661 |
---------------------------------------------------------------------------
display_stats(datasets['installments_payments'], 'installments_payments')
--------------------------------------------------------------------------------
installments_payments
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11383847 entries, 0 to 11383846
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 11383847 non-null int64
1 SK_ID_CURR 11383847 non-null int64
2 NUM_INSTALMENT_VERSION 11383847 non-null float64
3 NUM_INSTALMENT_NUMBER 11383847 non-null int64
4 DAYS_INSTALMENT 11383847 non-null float64
5 DAYS_ENTRY_PAYMENT 11382343 non-null float64
6 AMT_INSTALMENT 11383847 non-null float64
7 AMT_PAYMENT 11382343 non-null float64
dtypes: float64(5), int64(3)
memory usage: 694.8 MB
None
---------------------------------------------------------------------------
Shape of the df installments_payments is (11383847, 8)
---------------------------------------------------------------------------
Statistical summary of installments_payments is :
---------------------------------------------------------------------------
Description of the df installments_payments:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['installments_payments'], 'installments_payments')
Description of the df continued for installments_payments:
---------------------------------------------------------------------------
Data type value counts:
float64 5
int64 3
dtype: int64
Return number of unique elements in the object.
Series([], dtype: float64)
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of installments_payments.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'NUM_INSTALMENT_NUMBER'], dtype='object')}
------------------------------
{'float64': Index(['NUM_INSTALMENT_VERSION', 'DAYS_INSTALMENT', 'DAYS_ENTRY_PAYMENT',
'AMT_INSTALMENT', 'AMT_PAYMENT'],
dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| DAYS_ENTRY_PAYMENT | 0.01 | 1504 |
| AMT_PAYMENT | 0.01 | 1504 |
---------------------------------------------------------------------------
display_stats(datasets['POS_CASH_balance'], 'POS_CASH_balance')
--------------------------------------------------------------------------------
POS_CASH_balance
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 10001358 non-null int64
1 SK_ID_CURR 10001358 non-null int64
2 MONTHS_BALANCE 10001358 non-null int64
3 CNT_INSTALMENT 9975287 non-null float64
4 CNT_INSTALMENT_FUTURE 9975271 non-null float64
5 NAME_CONTRACT_STATUS 10001358 non-null object
6 SK_DPD 10001358 non-null int64
7 SK_DPD_DEF 10001358 non-null int64
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None
---------------------------------------------------------------------------
Shape of the df POS_CASH_balance is (10001358, 8)
---------------------------------------------------------------------------
Statistical summary of POS_CASH_balance is :
---------------------------------------------------------------------------
Description of the df POS_CASH_balance:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['POS_CASH_balance'], 'POS_CASH_balance')
Description of the df continued for POS_CASH_balance:
---------------------------------------------------------------------------
Data type value counts:
int64 5
float64 2
object 1
dtype: int64
Return number of unique elements in the object.
NAME_CONTRACT_STATUS 9
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of POS_CASH_balance.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'SK_DPD', 'SK_DPD_DEF'], dtype='object')}
------------------------------
{'float64': Index(['CNT_INSTALMENT', 'CNT_INSTALMENT_FUTURE'], dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_STATUS'], dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| CNT_INSTALMENT_FUTURE | 0.26 | 26087 |
| CNT_INSTALMENT | 0.26 | 26071 |
---------------------------------------------------------------------------
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
| Percent | Test Missing Count | |
|---|---|---|
| COMMONAREA_AVG | 68.72 | 33495 |
| COMMONAREA_MODE | 68.72 | 33495 |
| COMMONAREA_MEDI | 68.72 | 33495 |
| NONLIVINGAPARTMENTS_AVG | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MODE | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MEDI | 68.41 | 33347 |
| FONDKAPREMONT_MODE | 67.28 | 32797 |
| LIVINGAPARTMENTS_AVG | 67.25 | 32780 |
| LIVINGAPARTMENTS_MODE | 67.25 | 32780 |
| LIVINGAPARTMENTS_MEDI | 67.25 | 32780 |
| FLOORSMIN_MEDI | 66.61 | 32466 |
| FLOORSMIN_AVG | 66.61 | 32466 |
| FLOORSMIN_MODE | 66.61 | 32466 |
| OWN_CAR_AGE | 66.29 | 32312 |
| YEARS_BUILD_AVG | 65.28 | 31818 |
| YEARS_BUILD_MEDI | 65.28 | 31818 |
| YEARS_BUILD_MODE | 65.28 | 31818 |
| LANDAREA_MEDI | 57.96 | 28254 |
| LANDAREA_AVG | 57.96 | 28254 |
| LANDAREA_MODE | 57.96 | 28254 |
datasets["application_train"]['TARGET'].astype(int).plot.hist();
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Top 10 most Positive Correlations:\n', correlations.tail(10))
print('\nTop 10 most Negative Correlations:\n', correlations.head(10))
Top 10 most Positive Correlations: FLAG_DOCUMENT_3 0.044346 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 DAYS_LAST_PHONE_CHANGE 0.055218 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64 Top 10 most Negative Correlations: EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 FLOORSMAX_AVG -0.044003 FLOORSMAX_MEDI -0.043768 FLOORSMAX_MODE -0.043226 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 ELEVATORS_AVG -0.034199 Name: TARGET, dtype: float64
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df = datasets["application_train"].copy()
df2 = df[num_attribs]
corr = df2.corr()
corr.style.background_gradient(cmap='PuBu').set_precision(2)
| TARGET | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_EMPLOYED | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | AMT_GOODS_PRICE | |
|---|---|---|---|---|---|---|---|---|---|
| TARGET | 1.00 | -0.00 | -0.03 | -0.04 | 0.08 | -0.16 | -0.16 | -0.18 | -0.04 |
| AMT_INCOME_TOTAL | -0.00 | 1.00 | 0.16 | -0.06 | 0.03 | 0.03 | 0.06 | -0.03 | 0.16 |
| AMT_CREDIT | -0.03 | 0.16 | 1.00 | -0.07 | -0.06 | 0.17 | 0.13 | 0.04 | 0.99 |
| DAYS_EMPLOYED | -0.04 | -0.06 | -0.07 | 1.00 | -0.62 | 0.29 | -0.02 | 0.11 | -0.06 |
| DAYS_BIRTH | 0.08 | 0.03 | -0.06 | -0.62 | 1.00 | -0.60 | -0.09 | -0.21 | -0.05 |
| EXT_SOURCE_1 | -0.16 | 0.03 | 0.17 | 0.29 | -0.60 | 1.00 | 0.21 | 0.19 | 0.18 |
| EXT_SOURCE_2 | -0.16 | 0.06 | 0.13 | -0.02 | -0.09 | 0.21 | 1.00 | 0.11 | 0.14 |
| EXT_SOURCE_3 | -0.18 | -0.03 | 0.04 | 0.11 | -0.21 | 0.19 | 0.11 | 1.00 | 0.05 |
| AMT_GOODS_PRICE | -0.04 | 0.16 | 0.99 | -0.06 | -0.05 | 0.18 | 0.14 | 0.05 | 1.00 |
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df = datasets["application_train"].copy()
df2 = df[num_attribs]
corr = df2.corr()
corr.style.background_gradient(cmap='PuBu').set_precision(2)
| TARGET | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_EMPLOYED | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | AMT_GOODS_PRICE | |
|---|---|---|---|---|---|---|---|---|---|
| TARGET | 1.00 | -0.00 | -0.03 | -0.04 | 0.08 | -0.16 | -0.16 | -0.18 | -0.04 |
| AMT_INCOME_TOTAL | -0.00 | 1.00 | 0.16 | -0.06 | 0.03 | 0.03 | 0.06 | -0.03 | 0.16 |
| AMT_CREDIT | -0.03 | 0.16 | 1.00 | -0.07 | -0.06 | 0.17 | 0.13 | 0.04 | 0.99 |
| DAYS_EMPLOYED | -0.04 | -0.06 | -0.07 | 1.00 | -0.62 | 0.29 | -0.02 | 0.11 | -0.06 |
| DAYS_BIRTH | 0.08 | 0.03 | -0.06 | -0.62 | 1.00 | -0.60 | -0.09 | -0.21 | -0.05 |
| EXT_SOURCE_1 | -0.16 | 0.03 | 0.17 | 0.29 | -0.60 | 1.00 | 0.21 | 0.19 | 0.18 |
| EXT_SOURCE_2 | -0.16 | 0.06 | 0.13 | -0.02 | -0.09 | 0.21 | 1.00 | 0.11 | 0.14 |
| EXT_SOURCE_3 | -0.18 | -0.03 | 0.04 | 0.11 | -0.21 | 0.19 | 0.11 | 1.00 | 0.05 |
| AMT_GOODS_PRICE | -0.04 | 0.16 | 0.99 | -0.06 | -0.05 | 0.18 | 0.14 | 0.05 | 1.00 |
def numerical_features_plot(datasets, df_name):
df = datasets[df_name].copy()
df['TARGET'].replace(0, "No Default", inplace=True)
df['TARGET'].replace(1, "Default", inplace=True)
numerical_col = []
for col in df:
if df[col].dtype == 'int64' or df[col].dtype == 'float64' :
numerical_col.append(col)
print(numerical_col)
print(len(numerical_col))
df2 = df[numerical_col]
# Scatter-plot
df2.fillna(0, inplace=True)
# print('Numerical variables - Scatter-Matrix')
grr = pd.plotting.scatter_matrix(df2.loc[:, df2.columns != 'TARGET'], c = datasets[df_name]['TARGET'], figsize=(15, 15), marker='.',
hist_kwds={'bins': 10}, s=60, alpha=.2)
# Pair-plot
df2['TARGET'].replace(0, "No Default", inplace=True)
df2['TARGET'].replace(1, "Default", inplace=True)
# print('Numerical variables - Pair-Plot')
num_sns = sns.pairplot(df2, hue="TARGET", markers=["s", "o"])
run = True
if run:
df_name = 'application_train'
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df = datasets[df_name].copy()
df2 = df[num_attribs]
# Scatter-plot
df2.fillna(0, inplace=True)
# print('Numerical variables - Scatter-Matrix')
grr = pd.plotting.scatter_matrix(df2.loc[:, df2.columns != 'TARGET'],
c = datasets[df_name]['TARGET'],
figsize=(15, 15), marker='.',
hist_kwds={'bins': 10}, s=60, alpha=.2)
# Pair-plot
df2['TARGET'].replace(0, "No Default", inplace=True)
df2['TARGET'].replace(1, "Default", inplace=True)
# print('Numerical variables - Pair-Plot')
num_sns = sns.pairplot(df2, hue="TARGET", markers=["s", "o"])
# num_sns.title("Numerical variables - Pair-Plot")
import matplotlib.pyplot as plt
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);
datasets.keys()
dict_keys(['application_test', 'application_train', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
True
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
array([], dtype=int64)
datasets["application_test"].shape
(3850, 121)
datasets["application_train"].shape
(307511, 122)
The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.
display_stats(datasets['previous_application'], 'previous_application')
--------------------------------------------------------------------------------
previous_application
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46986 entries, 0 to 46985
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 46986 non-null int64
1 SK_ID_CURR 46986 non-null int64
2 NAME_CONTRACT_TYPE 46986 non-null object
3 AMT_ANNUITY 36985 non-null float64
4 AMT_APPLICATION 46986 non-null float64
5 AMT_CREDIT 46986 non-null float64
6 AMT_DOWN_PAYMENT 23126 non-null float64
7 AMT_GOODS_PRICE 36850 non-null float64
8 WEEKDAY_APPR_PROCESS_START 46986 non-null object
9 HOUR_APPR_PROCESS_START 46986 non-null int64
10 FLAG_LAST_APPL_PER_CONTRACT 46986 non-null object
11 NFLAG_LAST_APPL_IN_DAY 46986 non-null int64
12 RATE_DOWN_PAYMENT 23126 non-null float64
13 RATE_INTEREST_PRIMARY 153 non-null float64
14 RATE_INTEREST_PRIVILEGED 153 non-null float64
15 NAME_CASH_LOAN_PURPOSE 46986 non-null object
16 NAME_CONTRACT_STATUS 46985 non-null object
17 DAYS_DECISION 46985 non-null float64
18 NAME_PAYMENT_TYPE 46985 non-null object
19 CODE_REJECT_REASON 46985 non-null object
20 NAME_TYPE_SUITE 24167 non-null object
21 NAME_CLIENT_TYPE 46985 non-null object
22 NAME_GOODS_CATEGORY 46985 non-null object
23 NAME_PORTFOLIO 46985 non-null object
24 NAME_PRODUCT_TYPE 46985 non-null object
25 CHANNEL_TYPE 46985 non-null object
26 SELLERPLACE_AREA 46985 non-null float64
27 NAME_SELLER_INDUSTRY 46985 non-null object
28 CNT_PAYMENT 36984 non-null float64
29 NAME_YIELD_GROUP 46985 non-null object
30 PRODUCT_COMBINATION 46977 non-null object
31 DAYS_FIRST_DRAWING 28881 non-null float64
32 DAYS_FIRST_DUE 28881 non-null float64
33 DAYS_LAST_DUE_1ST_VERSION 28881 non-null float64
34 DAYS_LAST_DUE 28881 non-null float64
35 DAYS_TERMINATION 28881 non-null float64
36 NFLAG_INSURED_ON_APPROVAL 28881 non-null float64
dtypes: float64(17), int64(4), object(16)
memory usage: 13.3+ MB
None
---------------------------------------------------------------------------
Shape of the df previous_application is (46986, 37)
---------------------------------------------------------------------------
Statistical summary of previous_application is :
---------------------------------------------------------------------------
Description of the df previous_application:
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 3.075110e+05 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.0 | 307511.00 | 307511.0 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 1.687979e+05 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.0 | 0.82 | 0.2 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 2.371231e+05 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.0 | 0.38 | 0.4 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.0 | 0.06 | 0.0 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 2.565000e+04 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 1.125000e+05 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 1.471500e+05 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.0 | 1.00 | 0.0 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 2.025000e+05 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.0 | 1.00 | 0.0 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 1.170000e+08 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
None
display_feature_info(datasets['previous_application'], 'previous_application')
Description of the df continued for previous_application:
---------------------------------------------------------------------------
Data type value counts:
float64 17
object 16
int64 4
dtype: int64
Return number of unique elements in the object.
NAME_CONTRACT_TYPE 4
WEEKDAY_APPR_PROCESS_START 7
FLAG_LAST_APPL_PER_CONTRACT 2
NAME_CASH_LOAN_PURPOSE 25
NAME_CONTRACT_STATUS 4
NAME_PAYMENT_TYPE 4
CODE_REJECT_REASON 9
NAME_TYPE_SUITE 7
NAME_CLIENT_TYPE 4
NAME_GOODS_CATEGORY 26
NAME_PORTFOLIO 5
NAME_PRODUCT_TYPE 3
CHANNEL_TYPE 8
NAME_SELLER_INDUSTRY 11
NAME_YIELD_GROUP 5
PRODUCT_COMBINATION 17
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of previous_application.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'HOUR_APPR_PROCESS_START',
'NFLAG_LAST_APPL_IN_DAY'],
dtype='object')}
------------------------------
{'float64': Index(['AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT',
'AMT_GOODS_PRICE', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA',
'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START',
'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON',
'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY',
'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE',
'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION'],
dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| RATE_INTEREST_PRIVILEGED | 99.67 | 46833 |
| RATE_INTEREST_PRIMARY | 99.67 | 46833 |
| RATE_DOWN_PAYMENT | 50.78 | 23860 |
| AMT_DOWN_PAYMENT | 50.78 | 23860 |
| NAME_TYPE_SUITE | 48.57 | 22819 |
| NFLAG_INSURED_ON_APPROVAL | 38.53 | 18105 |
| DAYS_FIRST_DRAWING | 38.53 | 18105 |
| DAYS_FIRST_DUE | 38.53 | 18105 |
| DAYS_LAST_DUE_1ST_VERSION | 38.53 | 18105 |
| DAYS_LAST_DUE | 38.53 | 18105 |
| DAYS_TERMINATION | 38.53 | 18105 |
| AMT_GOODS_PRICE | 21.57 | 10136 |
| CNT_PAYMENT | 21.29 | 10002 |
| AMT_ANNUITY | 21.29 | 10001 |
| PRODUCT_COMBINATION | 0.02 | 9 |
---------------------------------------------------------------------------
appsDF = datasets["previous_application"]
len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"]))
481
print(f"There are {appsDF.shape[0]:,} previous applications")
There are 46,986 previous applications
# How many entries are there for each month?
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
len(prevAppCounts[prevAppCounts >40]) #more that 40 previous applications
0
sum(appsDF['SK_ID_CURR'].value_counts()==1)
37692
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100);
plt.grid()
plt.ylabel('cumulative number of IDs')
plt.xlabel('Number of previous applications per ID')
plt.title('Histogram of Number of previous applications for an ID')
Text(0.5, 1.0, 'Histogram of Number of previous applications for an ID')
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()>=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 10 or more previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
Percentage with 10 or more previous apps: 33.49588 Percentage with 40 or more previous apps: 0.0114
Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used forr classification. It is producing new features for both supervised and unsupervised learning, with the goal of simplifying and speeding up data transformations while also enhancing model accuracy. A terrible feature may have a direct impact on your model. Therefore, feature engineering is a key step to any machine learning model.
For HCDR as well, feature engeenering turns out to be the game changer. There are various features from various datasets that may or may not impact the target variable. Therefore, it is important to create feature families to experiment different model settings to obtain an accurate classifier.
Feature engineering includes -
Feature CreationFeature TransformationsFeature ExtractionFeature AggregationIn the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?
previous_application with application_x¶We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.
Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:
AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).
When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]I want you to think about this section and build on this.
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.appsDF.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
num_attributes = appsDF.select_dtypes(include=['int64', 'float64']).columns
num_attributes
Index(['SK_ID_PREV', 'SK_ID_CURR', 'AMT_ANNUITY', 'AMT_APPLICATION',
'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'HOUR_APPR_PROCESS_START', 'NFLAG_LAST_APPL_IN_DAY',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA',
'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
cat_attributes = appsDF.select_dtypes(exclude=['int64', 'float64']).columns
cat_attributes
Index(['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START',
'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON',
'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY',
'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE',
'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION'],
dtype='object')
print('----'*15)
print('Total number of features in feature set 1 - ',(len(num_attributes) + len(cat_attributes)))
print('----'*15)
print('Number of numerical attributes - ',len(num_attributes))
print('Number of categorical attributes - ',len(cat_attributes))
------------------------------------------------------------ Total number of features in feature set 1 - 37 ------------------------------------------------------------ Number of numerical attributes - 21 Number of categorical attributes - 16
appsDF.isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 10001 AMT_APPLICATION 0 AMT_CREDIT 0 AMT_DOWN_PAYMENT 23860 AMT_GOODS_PRICE 10136 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 23860 RATE_INTEREST_PRIMARY 46833 RATE_INTEREST_PRIVILEGED 46833 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 1 DAYS_DECISION 1 NAME_PAYMENT_TYPE 1 CODE_REJECT_REASON 1 NAME_TYPE_SUITE 22819 NAME_CLIENT_TYPE 1 NAME_GOODS_CATEGORY 1 NAME_PORTFOLIO 1 NAME_PRODUCT_TYPE 1 CHANNEL_TYPE 1 SELLERPLACE_AREA 1 NAME_SELLER_INDUSTRY 1 CNT_PAYMENT 10002 NAME_YIELD_GROUP 1 PRODUCT_COMBINATION 9 DAYS_FIRST_DRAWING 18105 DAYS_FIRST_DUE 18105 DAYS_LAST_DUE_1ST_VERSION 18105 DAYS_LAST_DUE 18105 DAYS_TERMINATION 18105 NFLAG_INSURED_ON_APPROVAL 18105 dtype: int64
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
agg_op_features = {}
cols = []
agg_func_list=["mean", "min", "max"]
for f in features: #build agg dictionary
agg_op_features[f] = agg_func_list
cols.append(f"{f}_{func}" for func in agg_func_list)
print(agg_op_features)
print(f"{appsDF[features].describe()}")
print()
# # results = appsDF.groupby('SK_ID_CURR').agg({'AMT_ANNUITY': ['mean', 'min', 'max'],'AMT_APPLICATION': ['mean', 'min', 'max'] })
# result = appsDF.groupby('SK_ID_CURR').agg({features[0]: ['mean', 'min', 'max'],features[1]: ['mean', 'min', 'max'] })
result = appsDF.groupby('SK_ID_CURR').agg(agg_op_features)
result.columns = ["_".join(x) for x in result.columns.ravel()]
result = result.reset_index(level=["SK_ID_CURR"])
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
print(f"result.shape: {result.shape}")
result[:10]
{'AMT_ANNUITY': ['mean', 'min', 'max'], 'AMT_APPLICATION': ['mean', 'min', 'max']}
AMT_ANNUITY AMT_APPLICATION
count 1.041823e+06 1.339313e+06
mean 1.590266e+04 1.745077e+05
std 1.474784e+04 2.917110e+05
min 0.000000e+00 0.000000e+00
25% 6.301035e+03 1.889550e+04
50% 1.125000e+04 7.087500e+04
75% 2.055431e+04 1.800000e+05
max 4.180581e+05 6.905160e+06
result.shape: (324625, 8)
| SK_ID_CURR | AMT_ANNUITY_mean | AMT_ANNUITY_min | AMT_ANNUITY_max | AMT_APPLICATION_mean | AMT_APPLICATION_min | AMT_APPLICATION_max | range_AMT_APPLICATION | |
|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 3951.00000 | 3951.000 | 3951.000 | 24835.500000 | 24835.5 | 24835.5 | 0.0 |
| 1 | 100002 | 9251.77500 | 9251.775 | 9251.775 | 179055.000000 | 179055.0 | 179055.0 | 0.0 |
| 2 | 100003 | 56553.99000 | 6737.310 | 98356.995 | 435436.500000 | 68809.5 | 900000.0 | 831190.5 |
| 3 | 100004 | 5357.25000 | 5357.250 | 5357.250 | 24282.000000 | 24282.0 | 24282.0 | 0.0 |
| 4 | 100005 | NaN | NaN | NaN | 0.000000 | 0.0 | 0.0 | 0.0 |
| 5 | 100006 | 21842.19000 | 2482.920 | 39954.510 | 251618.477143 | 0.0 | 675000.0 | 675000.0 |
| 6 | 100007 | 10198.80900 | 1834.290 | 16509.600 | 140136.300000 | 17176.5 | 247500.0 | 230323.5 |
| 7 | 100008 | 15839.69625 | 8019.090 | 25309.575 | 155701.800000 | 0.0 | 450000.0 | 450000.0 |
| 8 | 100009 | 8634.65400 | 7435.845 | 10418.670 | 65714.400000 | 40455.0 | 98239.5 | 57784.5 |
| 9 | 100011 | 18303.19500 | 9000.000 | 31295.250 | 202732.875000 | 0.0 | 675000.0 | 675000.0 |
result.isna().sum()
SK_ID_CURR 0 AMT_ANNUITY_mean 7734 AMT_ANNUITY_min 7734 AMT_ANNUITY_max 7734 AMT_APPLICATION_mean 0 AMT_APPLICATION_min 0 AMT_APPLICATION_max 0 range_AMT_APPLICATION 0 dtype: int64
# Create aggregate features (via pipeline)
class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None): # no *args or **kargs
self.features = features
self.agg_op_features = {}
for f in features:
# self.agg_op_features[f] = {f"{f}_{func}":func for func in ["min", "max", "mean"]}
self.agg_op_features[f] = ["min", "max", "mean"]
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
# result.columns = result.columns.droplevel()
result.columns = ["_".join(x) for x in result.columns.ravel()]
result = result.reset_index(level=["SK_ID_CURR"])
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
return result # return dataframe with the join key "SK_ID_CURR"
from sklearn.pipeline import make_pipeline
def test_driver_prevAppsFeaturesAggregater(df, features):
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n{df[features][0:5]}")
test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
return(test_pipeline.fit_transform(df))
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print(f"HELLO")
print(f"Test driver: \n{res[0:10]}")
print(f"input[features][0:10]: \n{appsDF[0:10]}")
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
df.shape: (1339313, 37)
df[['AMT_ANNUITY', 'AMT_APPLICATION']][0:5]:
AMT_ANNUITY AMT_APPLICATION
0 1730.430 17145.0
1 25188.615 607500.0
2 15060.735 112500.0
3 47041.335 450000.0
4 31924.395 337500.0
HELLO
Test driver:
SK_ID_CURR AMT_ANNUITY_min AMT_ANNUITY_max AMT_ANNUITY_mean \
0 100001 3951.000 3951.000 3951.00000
1 100002 9251.775 9251.775 9251.77500
2 100003 6737.310 98356.995 56553.99000
3 100004 5357.250 5357.250 5357.25000
4 100005 NaN NaN NaN
5 100006 2482.920 39954.510 21842.19000
6 100007 1834.290 16509.600 10198.80900
7 100008 8019.090 25309.575 15839.69625
8 100009 7435.845 10418.670 8634.65400
9 100011 9000.000 31295.250 18303.19500
AMT_APPLICATION_min AMT_APPLICATION_max AMT_APPLICATION_mean \
0 24835.5 24835.5 24835.500000
1 179055.0 179055.0 179055.000000
2 68809.5 900000.0 435436.500000
3 24282.0 24282.0 24282.000000
4 0.0 0.0 0.000000
5 0.0 675000.0 251618.477143
6 17176.5 247500.0 140136.300000
7 0.0 450000.0 155701.800000
8 40455.0 98239.5 65714.400000
9 0.0 675000.0 202732.875000
range_AMT_APPLICATION
0 0.0
1 0.0
2 831190.5
3 0.0
4 0.0
5 675000.0
6 230323.5
7 450000.0
8 57784.5
9 675000.0
input[features][0:10]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION \
0 2030495 271877 Consumer loans 1730.430 17145.0
1 2802425 108129 Cash loans 25188.615 607500.0
2 2523466 122040 Cash loans 15060.735 112500.0
3 2819243 176158 Cash loans 47041.335 450000.0
4 1784265 202054 Cash loans 31924.395 337500.0
5 1383531 199383 Cash loans 23703.930 315000.0
6 2315218 175704 Cash loans NaN 0.0
7 1656711 296299 Cash loans NaN 0.0
8 2367563 342292 Cash loans NaN 0.0
9 2579447 334349 Cash loans NaN 0.0
AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START \
0 17145.0 0.0 17145.0 SATURDAY
1 679671.0 NaN 607500.0 THURSDAY
2 136444.5 NaN 112500.0 TUESDAY
3 470790.0 NaN 450000.0 MONDAY
4 404055.0 NaN 337500.0 THURSDAY
5 340573.5 NaN 315000.0 SATURDAY
6 0.0 NaN NaN TUESDAY
7 0.0 NaN NaN MONDAY
8 0.0 NaN NaN MONDAY
9 0.0 NaN NaN SATURDAY
HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT \
0 15 Y
1 11 Y
2 11 Y
3 7 Y
4 9 Y
5 8 Y
6 11 Y
7 7 Y
8 15 Y
9 15 Y
NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY \
0 1 0.0 0.182832
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN NaN
5 1 NaN NaN
6 1 NaN NaN
7 1 NaN NaN
8 1 NaN NaN
9 1 NaN NaN
RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS \
0 0.867336 XAP Approved
1 NaN XNA Approved
2 NaN XNA Approved
3 NaN XNA Approved
4 NaN Repairs Refused
5 NaN Everyday expenses Approved
6 NaN XNA Canceled
7 NaN XNA Canceled
8 NaN XNA Canceled
9 NaN XNA Canceled
DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE \
0 -73 Cash through the bank XAP NaN
1 -164 XNA XAP Unaccompanied
2 -301 Cash through the bank XAP Spouse, partner
3 -512 Cash through the bank XAP NaN
4 -781 Cash through the bank HC NaN
5 -684 Cash through the bank XAP Family
6 -14 XNA XAP NaN
7 -21 XNA XAP NaN
8 -386 XNA XAP NaN
9 -57 XNA XAP NaN
NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE \
0 Repeater Mobile POS XNA
1 Repeater XNA Cash x-sell
2 Repeater XNA Cash x-sell
3 Repeater XNA Cash x-sell
4 Repeater XNA Cash walk-in
5 Repeater XNA Cash x-sell
6 Repeater XNA XNA XNA
7 Repeater XNA XNA XNA
8 Repeater XNA XNA XNA
9 Repeater XNA XNA XNA
CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY \
0 Country-wide 35.0 Connectivity
1 Contact center -1.0 XNA
2 Credit and cash offices -1.0 XNA
3 Credit and cash offices -1.0 XNA
4 Credit and cash offices -1.0 XNA
5 Credit and cash offices -1.0 XNA
6 Credit and cash offices -1.0 XNA
7 Credit and cash offices -1.0 XNA
8 Credit and cash offices -1.0 XNA
9 Credit and cash offices -1.0 XNA
CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING \
0 12.0 middle POS mobile with interest 365243.0
1 36.0 low_action Cash X-Sell: low 365243.0
2 12.0 high Cash X-Sell: high 365243.0
3 12.0 middle Cash X-Sell: middle 365243.0
4 24.0 high Cash Street: high NaN
5 18.0 low_normal Cash X-Sell: low 365243.0
6 NaN XNA Cash NaN
7 NaN XNA Cash NaN
8 NaN XNA Cash NaN
9 NaN XNA Cash NaN
DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION \
0 -42.0 300.0 -42.0 -37.0
1 -134.0 916.0 365243.0 365243.0
2 -271.0 59.0 365243.0 365243.0
3 -482.0 -152.0 -182.0 -177.0
4 NaN NaN NaN NaN
5 -654.0 -144.0 -144.0 -137.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
NFLAG_INSURED_ON_APPROVAL
0 0.0
1 1.0
2 1.0
3 1.0
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
9 NaN
datasets.keys()
dict_keys(['application_test', 'application_train', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
agg_funcs = ['min', 'max', 'mean', 'count', 'sum']
prevApps = datasets['previous_application']
prevApps_features = ['AMT_ANNUITY', 'AMT_APPLICATION']
bureau = datasets['bureau']
bureau_features = ['AMT_ANNUITY', 'AMT_CREDIT_SUM']
bureau_bal = datasets['bureau_balance']
bureau_bal_features = ['MONTHS_BALANCE']
cc_bal = datasets['credit_card_balance']
cc_bal_features = ['MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_INSTALMENT_MATURE_CUM']
installments_pmnts = datasets['installments_payments']
installments_pmnts_features = ['AMT_INSTALMENT', 'AMT_PAYMENT']
# Pipelines
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
class FeaturesAggregator(BaseEstimator, TransformerMixin):
def __init__(self, file_name=None, features=None, funcs=None): # no *args or **kargs
self.file_name = file_name
self.features = features
self.funcs = funcs
self.agg_op_features = {}
for f in self.features:
temp = {f"{file_name}_{f}_{func}":func for func in self.funcs}
self.agg_op_features[f]=[(k, v) for k, v in temp.items()]
print(self.agg_op_features)
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
result.columns = result.columns.droplevel()
result = result.reset_index(level=["SK_ID_CURR"])
return result # return dataframe with the join key "SK_ID_CURR"
class engineer_features(BaseEstimator, TransformerMixin):
def __init__(self, features=None):
self
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X['ef_INCOME_CREDIT_PERCENT'] = (
X.AMT_INCOME_TOTAL / X.AMT_CREDIT).replace(np.inf, 0)
# ADD INCOME PER FAMILY MEMBER
X['ef_FAM_MEMBER_INCOME'] = (
X.AMT_INCOME_TOTAL / X.CNT_FAM_MEMBERS).replace(np.inf, 0)
# ADD ANNUITY AS PERCENTAGE OF ANNUAL INCOME
X['ef_ANN_INCOME_PERCENT'] = (
X.AMT_ANNUITY / X.AMT_INCOME_TOTAL).replace(np.inf, 0)
return X
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
prevApps_feature_pipeline = Pipeline([
('prevApps_aggregator', FeaturesAggregator('prevApps', prevApps_features, agg_funcs)), # Aggregate across old and new features
])
bureau_feature_pipeline = Pipeline([
('bureau_aggregator', FeaturesAggregator('bureau', bureau_features, agg_funcs)), # Aggregate across old and new features
])
bureau_bal_features_pipeline = Pipeline([
('bureau_bal_aggregator', FeaturesAggregator('bureau_balance', bureau_bal_features , agg_funcs)), # Aggregate across old and new features
])
cc_bal_features_pipeline = Pipeline([
('cc_bal_aggregator', FeaturesAggregator('credit_card_balance', cc_bal_features , agg_funcs)), # Aggregate across old and new features
])
installments_pmnts_features_pipeline = Pipeline([
('installments_pmnts_features_aggregator', FeaturesAggregator('credit_card_balance', installments_pmnts_features , agg_funcs)), # Aggregate across old and new features
])
appln_feature_pipeline = Pipeline([
('engineer_features', engineer_features())
])
{'AMT_ANNUITY': [('prevApps_AMT_ANNUITY_min', 'min'), ('prevApps_AMT_ANNUITY_max', 'max'), ('prevApps_AMT_ANNUITY_mean', 'mean'), ('prevApps_AMT_ANNUITY_count', 'count'), ('prevApps_AMT_ANNUITY_sum', 'sum')], 'AMT_APPLICATION': [('prevApps_AMT_APPLICATION_min', 'min'), ('prevApps_AMT_APPLICATION_max', 'max'), ('prevApps_AMT_APPLICATION_mean', 'mean'), ('prevApps_AMT_APPLICATION_count', 'count'), ('prevApps_AMT_APPLICATION_sum', 'sum')]}
{'AMT_ANNUITY': [('bureau_AMT_ANNUITY_min', 'min'), ('bureau_AMT_ANNUITY_max', 'max'), ('bureau_AMT_ANNUITY_mean', 'mean'), ('bureau_AMT_ANNUITY_count', 'count'), ('bureau_AMT_ANNUITY_sum', 'sum')], 'AMT_CREDIT_SUM': [('bureau_AMT_CREDIT_SUM_min', 'min'), ('bureau_AMT_CREDIT_SUM_max', 'max'), ('bureau_AMT_CREDIT_SUM_mean', 'mean'), ('bureau_AMT_CREDIT_SUM_count', 'count'), ('bureau_AMT_CREDIT_SUM_sum', 'sum')]}
{'MONTHS_BALANCE': [('bureau_balance_MONTHS_BALANCE_min', 'min'), ('bureau_balance_MONTHS_BALANCE_max', 'max'), ('bureau_balance_MONTHS_BALANCE_mean', 'mean'), ('bureau_balance_MONTHS_BALANCE_count', 'count'), ('bureau_balance_MONTHS_BALANCE_sum', 'sum')]}
{'MONTHS_BALANCE': [('credit_card_balance_MONTHS_BALANCE_min', 'min'), ('credit_card_balance_MONTHS_BALANCE_max', 'max'), ('credit_card_balance_MONTHS_BALANCE_mean', 'mean'), ('credit_card_balance_MONTHS_BALANCE_count', 'count'), ('credit_card_balance_MONTHS_BALANCE_sum', 'sum')], 'AMT_BALANCE': [('credit_card_balance_AMT_BALANCE_min', 'min'), ('credit_card_balance_AMT_BALANCE_max', 'max'), ('credit_card_balance_AMT_BALANCE_mean', 'mean'), ('credit_card_balance_AMT_BALANCE_count', 'count'), ('credit_card_balance_AMT_BALANCE_sum', 'sum')], 'CNT_INSTALMENT_MATURE_CUM': [('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_min', 'min'), ('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_max', 'max'), ('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_mean', 'mean'), ('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_count', 'count'), ('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_sum', 'sum')]}
{'AMT_INSTALMENT': [('credit_card_balance_AMT_INSTALMENT_min', 'min'), ('credit_card_balance_AMT_INSTALMENT_max', 'max'), ('credit_card_balance_AMT_INSTALMENT_mean', 'mean'), ('credit_card_balance_AMT_INSTALMENT_count', 'count'), ('credit_card_balance_AMT_INSTALMENT_sum', 'sum')], 'AMT_PAYMENT': [('credit_card_balance_AMT_PAYMENT_min', 'min'), ('credit_card_balance_AMT_PAYMENT_max', 'max'), ('credit_card_balance_AMT_PAYMENT_mean', 'mean'), ('credit_card_balance_AMT_PAYMENT_count', 'count'), ('credit_card_balance_AMT_PAYMENT_sum', 'sum')]}
appsTrainDF = datasets['application_train']
prevAppsDF = datasets["previous_application"] #prev app
bureauDF = datasets["bureau"] #bureau app
bureaubalDF = datasets['bureau_balance']
ccbalDF = datasets["credit_card_balance"] #prev app
installmentspaymentsDF = datasets["installments_payments"] #bureau app
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
prevApps_feature_pipeline = Pipeline([
('prevApps_aggregator', FeaturesAggregator('prevApps', prevApps_features, agg_funcs)), # Aggregate across old and new features
])
X_train= datasets["application_train"] #primary dataset
appsDF = datasets["previous_application"] #prev app
merge_all_data = True
if merge_all_data:
prevApps_aggregated = prevApps_feature_pipeline.transform(appsDF)
# merge primary table and secondary tables using features based on meta data and aggregage stats
if merge_all_data:
# 1. Join/Merge in prevApps Data
X_train = X_train.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
# ......
{'AMT_ANNUITY': [('prevApps_AMT_ANNUITY_min', 'min'), ('prevApps_AMT_ANNUITY_max', 'max'), ('prevApps_AMT_ANNUITY_mean', 'mean'), ('prevApps_AMT_ANNUITY_count', 'count'), ('prevApps_AMT_ANNUITY_sum', 'sum')], 'AMT_APPLICATION': [('prevApps_AMT_APPLICATION_min', 'min'), ('prevApps_AMT_APPLICATION_max', 'max'), ('prevApps_AMT_APPLICATION_mean', 'mean'), ('prevApps_AMT_APPLICATION_count', 'count'), ('prevApps_AMT_APPLICATION_sum', 'sum')]}
appsTrainDF = appln_feature_pipeline.fit_transform(appsTrainDF)
prevApps_aggregated = prevApps_feature_pipeline.fit_transform(prevAppsDF)
bureau_aggregated = bureau_feature_pipeline.fit_transform(bureauDF)
X_kaggle_test= datasets["application_test"]
X_kaggle_test = appln_feature_pipeline.fit_transform(X_kaggle_test)
merge_all_data = True
if merge_all_data:
# 1. Join/Merge in prevApps Data
X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
# 2. Join/Merge in ...... Data
#X_train = X_train.merge(...._aggregated, how='left', on="SK_ID_CURR")
# 3. Join/Merge in .....Data
#df_labeled = df_labeled.merge(...._aggregated, how='left', on="SK_ID_CURR")
# 4. Join/Merge in Aggregated ...... Data
#df_labeled = df_labeled.merge(...._aggregated, how='left', on="SK_ID_CURR")
# ......
# Function used to create a correlation matrix with Application train
def correlation(df):
app_train = datasets["application_train"].copy()
df = datasets[df].copy()
correlation_matrix = pd.concat([app_train.TARGET, df], axis=1).corr().filter(df.columns).filter(app_train.columns, axis=0)
return correlation_matrix
# Correlation matrix of Application train and previous_application
ds_name = 'previous_application'
correlation_matrix = correlation(ds_name)
print(f"Correlation of the {ds_name} against the Target is :")
correlation_matrix.style.background_gradient(cmap='coolwarm').set_precision(3)
Correlation of the previous_application against the Target is :
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 0.002 | 1.000 | 0.002 | 0.003 | 0.001 | -0.010 | 0.004 | 0.003 | 0.004 | -0.005 | 0.070 | -0.164 | -0.005 | 0.000 | -0.006 | 0.001 | 0.005 | 0.001 | -0.009 | -0.009 | -0.004 |
| TARGET | 0.004 | -0.003 | -0.002 | -0.002 | -0.003 | 0.006 | -0.001 | -0.005 | 0.001 | 0.002 | 0.212 | -0.090 | 0.002 | -0.002 | -0.001 | -0.006 | -0.009 | -0.000 | -0.012 | -0.006 | -0.007 |
| AMT_CREDIT | 0.000 | 0.001 | 0.820 | 0.976 | 1.000 | 0.303 | 0.993 | -0.032 | -0.023 | -0.192 | 0.081 | -0.227 | 0.136 | -0.005 | 0.668 | -0.043 | -0.009 | 0.042 | 0.231 | 0.223 | 0.271 |
| AMT_ANNUITY | 0.002 | 0.002 | 1.000 | 0.813 | 0.820 | 0.267 | 0.826 | -0.039 | 0.020 | -0.105 | 0.086 | -0.238 | 0.270 | -0.010 | 0.389 | 0.044 | -0.066 | -0.069 | 0.086 | 0.075 | 0.285 |
| AMT_GOODS_PRICE | 0.009 | 0.004 | 0.826 | 1.000 | 0.993 | 0.463 | 1.000 | -0.056 | -0.015 | -0.080 | 0.058 | -0.247 | 0.279 | -0.008 | 0.665 | -0.028 | -0.030 | 0.013 | 0.219 | 0.217 | 0.251 |
| HOUR_APPR_PROCESS_START | -0.001 | 0.003 | -0.039 | -0.023 | -0.032 | 0.018 | -0.056 | 1.000 | 0.011 | 0.015 | -0.130 | -0.005 | -0.036 | 0.016 | -0.066 | 0.020 | -0.008 | -0.025 | -0.019 | -0.022 | -0.120 |
correlation_matrix.T.TARGET.sort_values(ascending= False)
RATE_INTEREST_PRIMARY 0.212147 AMT_DOWN_PAYMENT 0.006014 SK_ID_PREV 0.003748 RATE_DOWN_PAYMENT 0.001717 DAYS_DECISION 0.001519 NFLAG_LAST_APPL_IN_DAY 0.001207 DAYS_LAST_DUE_1ST_VERSION -0.000284 AMT_GOODS_PRICE -0.000700 CNT_PAYMENT -0.000907 SELLERPLACE_AREA -0.001558 AMT_ANNUITY -0.002035 AMT_APPLICATION -0.002062 SK_ID_CURR -0.002737 AMT_CREDIT -0.002894 HOUR_APPR_PROCESS_START -0.005145 DAYS_TERMINATION -0.006304 DAYS_FIRST_DRAWING -0.006484 NFLAG_INSURED_ON_APPROVAL -0.006577 DAYS_FIRST_DUE -0.009374 DAYS_LAST_DUE -0.012453 RATE_INTEREST_PRIVILEGED -0.089622 Name: TARGET, dtype: float64
# Correlation matrix of Application train and Bureau
ds_name = 'bureau'
correlation_matrix = correlation(ds_name)
print(f"Correlation of the {ds_name} against the Target is :")
correlation_matrix.style.background_gradient(cmap='coolwarm').set_precision(3)
Correlation of the bureau against the Target is :
| SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 1.000 | 0.002 | 0.000 | -0.000 | 0.000 | -0.001 | 0.001 | -0.001 | 0.001 | -0.001 | -0.000 | -0.000 | 0.001 | -0.003 |
| TARGET | -0.001 | 0.002 | 0.001 | -0.002 | 0.002 | 0.000 | -0.000 | -0.000 | 0.000 | -0.001 | -0.001 | -0.001 | 0.002 | 0.000 |
| AMT_ANNUITY | -0.003 | 0.003 | 0.005 | -0.000 | 0.000 | 0.005 | 0.000 | -0.000 | 0.049 | 0.023 | 0.004 | 0.000 | 0.008 | 1.000 |
correlation_matrix.T.TARGET.sort_values(ascending= False)
DAYS_CREDIT_UPDATE 0.002159 DAYS_CREDIT_ENDDATE 0.002048 SK_ID_BUREAU 0.001550 DAYS_CREDIT 0.001443 AMT_CREDIT_SUM 0.000218 DAYS_ENDDATE_FACT 0.000203 AMT_ANNUITY 0.000189 AMT_CREDIT_MAX_OVERDUE -0.000389 CNT_CREDIT_PROLONG -0.000495 AMT_CREDIT_SUM_LIMIT -0.000558 AMT_CREDIT_SUM_DEBT -0.000946 SK_ID_CURR -0.001070 AMT_CREDIT_SUM_OVERDUE -0.001464 CREDIT_DAY_OVERDUE -0.001815 Name: TARGET, dtype: float64
# Correlation matrix of Application train and Bureau Balance
ds_name = 'bureau_balance'
correlation_matrix = correlation(ds_name)
print(f"Correlation of the {ds_name} against the Target is :")
correlation_matrix.style.background_gradient(cmap='coolwarm').set_precision(3)
Correlation of the bureau_balance against the Target is :
| SK_ID_BUREAU | MONTHS_BALANCE | |
|---|---|---|
| TARGET | 0.001 | -0.005 |
correlation_matrix.T.TARGET.sort_values(ascending= False)
SK_ID_BUREAU 0.001223 MONTHS_BALANCE -0.005262 Name: TARGET, dtype: float64
# Correlation matrix of Application train and POS CASH Balance
ds_name = 'POS_CASH_balance'
correlation_matrix = correlation(ds_name)
print(f"Correlation of the {ds_name} against the Target is :")
correlation_matrix.style.background_gradient(cmap='coolwarm').set_precision(3)
Correlation of the POS_CASH_balance against the Target is :
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|
| SK_ID_CURR | -0.000 | 1.000 | 0.000 | 0.000 | -0.001 | 0.003 | 0.002 |
| TARGET | 0.002 | -0.000 | 0.003 | 0.001 | 0.003 | 0.000 | -0.001 |
correlation_matrix.T.TARGET.sort_values(ascending= False)
CNT_INSTALMENT_FUTURE 0.002811 MONTHS_BALANCE 0.002775 SK_ID_PREV 0.002164 CNT_INSTALMENT 0.001434 SK_DPD 0.000050 SK_ID_CURR -0.000136 SK_DPD_DEF -0.001362 Name: TARGET, dtype: float64
# Correlation matrix of Application train and Installments Payments
ds_name = 'installments_payments'
correlation_matrix = correlation(ds_name)
print(f"Correlation of the {ds_name} against the Target is :")
correlation_matrix.style.background_gradient(cmap='coolwarm').set_precision(3)
Correlation of the installments_payments against the Target is :
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 0.004 | 1.000 | -0.001 | 0.002 | 0.003 | 0.003 | -0.001 | -0.001 |
| TARGET | 0.003 | -0.001 | 0.003 | 0.001 | -0.004 | -0.004 | -0.004 | -0.004 |
correlation_matrix.T.TARGET.sort_values(ascending= False)
SK_ID_PREV 0.002891 NUM_INSTALMENT_VERSION 0.002511 NUM_INSTALMENT_NUMBER 0.000626 SK_ID_CURR -0.000781 AMT_PAYMENT -0.003512 DAYS_INSTALMENT -0.003955 AMT_INSTALMENT -0.003972 DAYS_ENTRY_PAYMENT -0.004046 Name: TARGET, dtype: float64
# Correlation matrix of Application train and Credit Card Balance
ds_name = 'credit_card_balance'
correlation_matrix = correlation(ds_name)
print(f"Correlation of the {ds_name} against the Target is :")
correlation_matrix.style.background_gradient(cmap='coolwarm').set_precision(3)
Correlation of the credit_card_balance against the Target is :
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 0.004 | 1.000 | 0.003 | 0.006 | 0.006 | 0.003 | 0.003 | 0.004 | -0.001 | 0.006 | 0.002 | 0.002 | 0.006 | 0.006 | 0.006 | 0.005 | 0.003 | 0.002 | 0.001 | -0.002 | -0.001 | 0.002 |
| TARGET | 0.000 | 0.001 | -0.001 | 0.000 | 0.001 | 0.002 | -0.001 | -0.003 | -0.004 | 0.001 | -0.001 | -0.001 | 0.000 | 0.000 | 0.000 | 0.002 | -0.002 | -0.002 | -0.002 | -0.000 | 0.000 | -0.000 |
correlation_matrix.T.TARGET.sort_values(ascending= False)
CNT_DRAWINGS_ATM_CURRENT 0.001908 AMT_DRAWINGS_ATM_CURRENT 0.001520 AMT_INST_MIN_REGULARITY 0.001435 SK_ID_CURR 0.001086 AMT_CREDIT_LIMIT_ACTUAL 0.000515 AMT_BALANCE 0.000448 SK_ID_PREV 0.000446 AMT_RECIVABLE 0.000412 AMT_TOTAL_RECEIVABLE 0.000407 AMT_RECEIVABLE_PRINCIPAL 0.000383 SK_DPD 0.000092 SK_DPD_DEF -0.000201 CNT_INSTALMENT_MATURE_CUM -0.000342 MONTHS_BALANCE -0.000768 AMT_PAYMENT_CURRENT -0.001129 AMT_PAYMENT_TOTAL_CURRENT -0.001395 AMT_DRAWINGS_CURRENT -0.001419 CNT_DRAWINGS_CURRENT -0.001764 CNT_DRAWINGS_OTHER_CURRENT -0.001833 CNT_DRAWINGS_POS_CURRENT -0.002387 AMT_DRAWINGS_OTHER_CURRENT -0.002672 AMT_DRAWINGS_POS_CURRENT -0.003518 Name: TARGET, dtype: float64
class FeaturesAggregator(BaseEstimator, TransformerMixin):
def __init__(self, file_name=None, features=None, funcs=None, primary_id = None):
self.file_name = file_name
self.features = features
self.funcs = funcs
self.primary_id = primary_id
self.agg_op_features = {}
for f in self.features:
temp = {f"{file_name}_{f}_{func}":func for func in self.funcs}
self.agg_op_features[f]=[(k, v) for k, v in temp.items()]
print(self.agg_op_features)
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
result = X.groupby([self.primary_id]).agg(self.agg_op_features)
result.columns = result.columns.droplevel()
result = result.reset_index(level=[self.primary_id])
return result # return dataframe with the join key "SK_ID_CURR"
Different set of features can be used to create a new feature that might be helpful for classification. After data analysis, we have found the following three features those can be engineered -
Income Credit percentage - Total income / Credit amount
Average family member income - Total family income / count of family members
Annuity income percentage - Annuity / Total income
class engineer_features(BaseEstimator, TransformerMixin):
def __init__(self, features=None):
self
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# Total income / Credit amount
X['ef_INCOME_CREDIT_PERCENT'] = (
X.AMT_INCOME_TOTAL / X.AMT_CREDIT).replace(np.inf, 0)
# Total family income / count of family members
X['ef_FAM_MEMBER_INCOME'] = (
X.AMT_INCOME_TOTAL / X.CNT_FAM_MEMBERS).replace(np.inf, 0)
# Annuity / Total income
X['ef_ANN_INCOME_PERCENT'] = (
X.AMT_ANNUITY / X.AMT_INCOME_TOTAL).replace(np.inf, 0)
return X
class prep_OCCUPATION_TYPE(BaseEstimator, TransformerMixin):
def __init__(self, features="OCCUPATION_TYPE"): # no *args or **kargs
self.features = features
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X):
df = pd.DataFrame(X, columns=self.features)
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.)
#df.drop(self.features, axis=1, inplace=True)
return np.array(df.values)
appsTrainDF = datasets['application_train']
appsTestDF = datasets['application_test']
prevAppsDF = datasets["previous_application"]
bureauDF = datasets["bureau"]
bureaubalDF = datasets['bureau_balance']
ccbalDF = datasets["credit_card_balance"] #prev app
installmentspaymentsDF = datasets["installments_payments"] #bureau app
pos_cash_bal_DF = datasets["POS_CASH_balance"] #POS_CASH_balance app
secondary_datasets = ['previous_application','bureau']
feature_set_2 = appsTrainDF.merge(bureauDF[['SK_ID_CURR','DAYS_CREDIT_UPDATE','DAYS_CREDIT_ENDDATE','DAYS_CREDIT','AMT_CREDIT_SUM']],how='left', on='SK_ID_CURR')
num_attributes = feature_set_2.select_dtypes(include=['int64', 'float64']).columns
print(num_attributes)
cat_attributes = feature_set_2.select_dtypes(exclude=['int64', 'float64']).columns
print(cat_attributes)
print('----'*15)
print('Total number of features in feature set 1 - ',(len(num_attributes) + len(cat_attributes)))
print('----'*15)
print('Number of numerical attributes - ',len(num_attributes))
print('Number of categorical attributes - ',len(cat_attributes))
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
...
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR', 'ef_INCOME_CREDIT_PERCENT',
'ef_FAM_MEMBER_INCOME', 'ef_ANN_INCOME_PERCENT', 'DAYS_CREDIT_UPDATE',
'DAYS_CREDIT_ENDDATE', 'DAYS_CREDIT', 'AMT_CREDIT_SUM'],
dtype='object', length=113)
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
dtype='object')
------------------------------------------------------------
Total number of features in feature set 1 - 129
------------------------------------------------------------
Number of numerical attributes - 113
Number of categorical attributes - 16
tertiaty_datasets=['bureau_balance','credit_card_balance','installments_payments','POS_CASH_balance']
payment_diff_curr_pay = total current payment - current payment
payment_diff_min_pay = total current payment - installment minimum regularity
ccbalDF['payment_diff_curr_pay'] = ccbalDF['AMT_PAYMENT_TOTAL_CURRENT'] - ccbalDF['AMT_PAYMENT_CURRENT']
ccbalDF['payment_diff_min_pay'] = ccbalDF['AMT_PAYMENT_TOTAL_CURRENT'] - ccbalDF['AMT_INST_MIN_REGULARITY']
def get_numattribs(ds_name):
num_attribs=(datasets[ds_name].select_dtypes(include=['int64', 'float64']).columns.tolist())
print()
print('Numerical attributes for',ds_name,' : ',num_attribs)
print()
return num_attribs
def get_catattribs(ds_name):
cat_attribs=(datasets[ds_name].select_dtypes(include=['object','string']).columns.tolist())
print()
print('Categorical attributes for',ds_name,' : ',cat_attribs)
print()
return cat_attribs
# Aggregate across old and new features
# agg_funcs = ['min', 'max', 'mean', 'count', 'sum']
agg_funcs = ['min', 'max']
primary_id1 = "SK_ID_PREV"
primary_id2 = "SK_ID_BUREAU"
posBal_features = ['MONTHS_BALANCE','CNT_INSTALMENT','CNT_INSTALMENT_FUTURE']
instalPay_features = ['DAYS_INSTALMENT','AMT_INSTALMENT']
ccBal_features = ['AMT_BALANCE','AMT_DRAWINGS_CURRENT','payment_diff_curr_pay','payment_diff_min_pay']
burBal_features = ['MONTHS_BALANCE']
prevApps_features = ['AMT_APPLICATION','AMT_CREDIT','AMT_ANNUITY'] # NO MISSING VALUES
bureau_features = ['AMT_CREDIT_SUM']
cc_features_pipeline = Pipeline([
('credit_card_num_aggregator', FeaturesAggregator('credit_card_balance',ccBal_features , agg_funcs, primary_id1)),
])
installment_features_pipeline = Pipeline([
('installment_num_aggregator', FeaturesAggregator('installments_payments',instalPay_features, agg_funcs, primary_id1)),
])
POS_CASH_balance_pipeline = Pipeline([
('POS_CASH_balance', FeaturesAggregator('POS_CASH_balance' ,posBal_features , agg_funcs, primary_id1)),
])
bureau_balance_feature_pipeline = Pipeline([
('bureau_balance', FeaturesAggregator('bureau_balance' ,burBal_features , agg_funcs, primary_id2)),
])
{'AMT_BALANCE': [('credit_card_balance_AMT_BALANCE_min', 'min'), ('credit_card_balance_AMT_BALANCE_max', 'max')], 'AMT_DRAWINGS_CURRENT': [('credit_card_balance_AMT_DRAWINGS_CURRENT_min', 'min'), ('credit_card_balance_AMT_DRAWINGS_CURRENT_max', 'max')], 'payment_diff_curr_pay': [('credit_card_balance_payment_diff_curr_pay_min', 'min'), ('credit_card_balance_payment_diff_curr_pay_max', 'max')], 'payment_diff_min_pay': [('credit_card_balance_payment_diff_min_pay_min', 'min'), ('credit_card_balance_payment_diff_min_pay_max', 'max')]}
{'DAYS_INSTALMENT': [('installments_payments_DAYS_INSTALMENT_min', 'min'), ('installments_payments_DAYS_INSTALMENT_max', 'max')], 'AMT_INSTALMENT': [('installments_payments_AMT_INSTALMENT_min', 'min'), ('installments_payments_AMT_INSTALMENT_max', 'max')]}
{'MONTHS_BALANCE': [('POS_CASH_balance_MONTHS_BALANCE_min', 'min'), ('POS_CASH_balance_MONTHS_BALANCE_max', 'max')], 'CNT_INSTALMENT': [('POS_CASH_balance_CNT_INSTALMENT_min', 'min'), ('POS_CASH_balance_CNT_INSTALMENT_max', 'max')], 'CNT_INSTALMENT_FUTURE': [('POS_CASH_balance_CNT_INSTALMENT_FUTURE_min', 'min'), ('POS_CASH_balance_CNT_INSTALMENT_FUTURE_max', 'max')]}
{'MONTHS_BALANCE': [('bureau_balance_MONTHS_BALANCE_min', 'min'), ('bureau_balance_MONTHS_BALANCE_max', 'max')]}
bureaubal_aggregated = bureau_balance_feature_pipeline.fit_transform(bureaubalDF)
ccblance_aggregated = cc_features_pipeline.fit_transform(ccbalDF)
installments_pmnts_aggregated = installment_features_pipeline.fit_transform(installmentspaymentsDF)
pos_cash_bal_aggregated = POS_CASH_balance_pipeline.fit_transform(pos_cash_bal_DF)
Previous Apps¶prevApps_ThirdTierMerge = True
posBal_join_feature = 'SK_ID_PREV'
prevApps_join_feature = 'SK_ID_CURR'
bureau_join_feature = 'SK_ID_CURR'
instalPay_join_feature = 'SK_ID_PREV'
ccBal_join_feature = 'SK_ID_PREV'
burBal_join_feature = 'SK_ID_BUREAU'
if prevApps_ThirdTierMerge:
# Merge Datasets
prevAppsDF = prevAppsDF.merge(pos_cash_bal_aggregated, how='left', on=posBal_join_feature)
prevAppsDF = prevAppsDF.merge(installments_pmnts_aggregated, how='left', on=instalPay_join_feature)
prevAppsDF = prevAppsDF.merge(ccblance_aggregated, how='left', on=ccBal_join_feature)
prevApps_features.extend(installments_pmnts_aggregated.columns[1:])
prevApps_features.extend(ccblance_aggregated.columns[1:])
prevApps_features.extend(pos_cash_bal_aggregated.columns[1:])
prevApps_features
['AMT_APPLICATION', 'AMT_CREDIT', 'AMT_ANNUITY', 'installments_payments_DAYS_INSTALMENT_min', 'installments_payments_DAYS_INSTALMENT_max', 'installments_payments_AMT_INSTALMENT_min', 'installments_payments_AMT_INSTALMENT_max', 'credit_card_balance_AMT_BALANCE_min', 'credit_card_balance_AMT_BALANCE_max', 'credit_card_balance_AMT_DRAWINGS_CURRENT_min', 'credit_card_balance_AMT_DRAWINGS_CURRENT_max', 'credit_card_balance_payment_diff_curr_pay_min', 'credit_card_balance_payment_diff_curr_pay_max', 'credit_card_balance_payment_diff_min_pay_min', 'credit_card_balance_payment_diff_min_pay_max', 'POS_CASH_balance_MONTHS_BALANCE_min', 'POS_CASH_balance_MONTHS_BALANCE_max', 'POS_CASH_balance_CNT_INSTALMENT_min', 'POS_CASH_balance_CNT_INSTALMENT_max', 'POS_CASH_balance_CNT_INSTALMENT_FUTURE_min', 'POS_CASH_balance_CNT_INSTALMENT_FUTURE_max']
Bureau¶bureau_ThirdTierMerge = True
if bureau_ThirdTierMerge:
# Merge Dataset
bureauDF = bureauDF.merge(bureaubal_aggregated, how='left', on=burBal_join_feature)
# Add Created Features
bureau_features.extend(bureaubal_aggregated.columns[1:])
#agg_funcs = ['min', 'max', 'mean', 'count', 'sum']
agg_funcs = ['count', 'max', 'min', 'sum']
primary_id1 = "SK_ID_CURR"
prevApps_feature_pipeline = Pipeline([
('prevApps', FeaturesAggregator('prevApps' ,prevApps_features , agg_funcs, primary_id1)),
])
bureau_feature_pipeline = Pipeline([
('bureau', FeaturesAggregator('bureau' ,bureau_features , agg_funcs, primary_id1)),
])
{'AMT_APPLICATION': [('prevApps_AMT_APPLICATION_count', 'count'), ('prevApps_AMT_APPLICATION_max', 'max'), ('prevApps_AMT_APPLICATION_min', 'min'), ('prevApps_AMT_APPLICATION_sum', 'sum')], 'AMT_CREDIT': [('prevApps_AMT_CREDIT_count', 'count'), ('prevApps_AMT_CREDIT_max', 'max'), ('prevApps_AMT_CREDIT_min', 'min'), ('prevApps_AMT_CREDIT_sum', 'sum')], 'AMT_ANNUITY': [('prevApps_AMT_ANNUITY_count', 'count'), ('prevApps_AMT_ANNUITY_max', 'max'), ('prevApps_AMT_ANNUITY_min', 'min'), ('prevApps_AMT_ANNUITY_sum', 'sum')], 'installments_payments_DAYS_INSTALMENT_min': [('prevApps_installments_payments_DAYS_INSTALMENT_min_count', 'count'), ('prevApps_installments_payments_DAYS_INSTALMENT_min_max', 'max'), ('prevApps_installments_payments_DAYS_INSTALMENT_min_min', 'min'), ('prevApps_installments_payments_DAYS_INSTALMENT_min_sum', 'sum')], 'installments_payments_DAYS_INSTALMENT_max': [('prevApps_installments_payments_DAYS_INSTALMENT_max_count', 'count'), ('prevApps_installments_payments_DAYS_INSTALMENT_max_max', 'max'), ('prevApps_installments_payments_DAYS_INSTALMENT_max_min', 'min'), ('prevApps_installments_payments_DAYS_INSTALMENT_max_sum', 'sum')], 'installments_payments_AMT_INSTALMENT_min': [('prevApps_installments_payments_AMT_INSTALMENT_min_count', 'count'), ('prevApps_installments_payments_AMT_INSTALMENT_min_max', 'max'), ('prevApps_installments_payments_AMT_INSTALMENT_min_min', 'min'), ('prevApps_installments_payments_AMT_INSTALMENT_min_sum', 'sum')], 'installments_payments_AMT_INSTALMENT_max': [('prevApps_installments_payments_AMT_INSTALMENT_max_count', 'count'), ('prevApps_installments_payments_AMT_INSTALMENT_max_max', 'max'), ('prevApps_installments_payments_AMT_INSTALMENT_max_min', 'min'), ('prevApps_installments_payments_AMT_INSTALMENT_max_sum', 'sum')], 'credit_card_balance_AMT_BALANCE_min': [('prevApps_credit_card_balance_AMT_BALANCE_min_count', 'count'), ('prevApps_credit_card_balance_AMT_BALANCE_min_max', 'max'), ('prevApps_credit_card_balance_AMT_BALANCE_min_min', 'min'), ('prevApps_credit_card_balance_AMT_BALANCE_min_sum', 'sum')], 'credit_card_balance_AMT_BALANCE_max': [('prevApps_credit_card_balance_AMT_BALANCE_max_count', 'count'), ('prevApps_credit_card_balance_AMT_BALANCE_max_max', 'max'), ('prevApps_credit_card_balance_AMT_BALANCE_max_min', 'min'), ('prevApps_credit_card_balance_AMT_BALANCE_max_sum', 'sum')], 'credit_card_balance_AMT_DRAWINGS_CURRENT_min': [('prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_count', 'count'), ('prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_max', 'max'), ('prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_min', 'min'), ('prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_sum', 'sum')], 'credit_card_balance_AMT_DRAWINGS_CURRENT_max': [('prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_count', 'count'), ('prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_max', 'max'), ('prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_min', 'min'), ('prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_sum', 'sum')], 'credit_card_balance_payment_diff_curr_pay_min': [('prevApps_credit_card_balance_payment_diff_curr_pay_min_count', 'count'), ('prevApps_credit_card_balance_payment_diff_curr_pay_min_max', 'max'), ('prevApps_credit_card_balance_payment_diff_curr_pay_min_min', 'min'), ('prevApps_credit_card_balance_payment_diff_curr_pay_min_sum', 'sum')], 'credit_card_balance_payment_diff_curr_pay_max': [('prevApps_credit_card_balance_payment_diff_curr_pay_max_count', 'count'), ('prevApps_credit_card_balance_payment_diff_curr_pay_max_max', 'max'), ('prevApps_credit_card_balance_payment_diff_curr_pay_max_min', 'min'), ('prevApps_credit_card_balance_payment_diff_curr_pay_max_sum', 'sum')], 'credit_card_balance_payment_diff_min_pay_min': [('prevApps_credit_card_balance_payment_diff_min_pay_min_count', 'count'), ('prevApps_credit_card_balance_payment_diff_min_pay_min_max', 'max'), ('prevApps_credit_card_balance_payment_diff_min_pay_min_min', 'min'), ('prevApps_credit_card_balance_payment_diff_min_pay_min_sum', 'sum')], 'credit_card_balance_payment_diff_min_pay_max': [('prevApps_credit_card_balance_payment_diff_min_pay_max_count', 'count'), ('prevApps_credit_card_balance_payment_diff_min_pay_max_max', 'max'), ('prevApps_credit_card_balance_payment_diff_min_pay_max_min', 'min'), ('prevApps_credit_card_balance_payment_diff_min_pay_max_sum', 'sum')], 'POS_CASH_balance_MONTHS_BALANCE_min': [('prevApps_POS_CASH_balance_MONTHS_BALANCE_min_count', 'count'), ('prevApps_POS_CASH_balance_MONTHS_BALANCE_min_max', 'max'), ('prevApps_POS_CASH_balance_MONTHS_BALANCE_min_min', 'min'), ('prevApps_POS_CASH_balance_MONTHS_BALANCE_min_sum', 'sum')], 'POS_CASH_balance_MONTHS_BALANCE_max': [('prevApps_POS_CASH_balance_MONTHS_BALANCE_max_count', 'count'), ('prevApps_POS_CASH_balance_MONTHS_BALANCE_max_max', 'max'), ('prevApps_POS_CASH_balance_MONTHS_BALANCE_max_min', 'min'), ('prevApps_POS_CASH_balance_MONTHS_BALANCE_max_sum', 'sum')], 'POS_CASH_balance_CNT_INSTALMENT_min': [('prevApps_POS_CASH_balance_CNT_INSTALMENT_min_count', 'count'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_min_max', 'max'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_min_min', 'min'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_min_sum', 'sum')], 'POS_CASH_balance_CNT_INSTALMENT_max': [('prevApps_POS_CASH_balance_CNT_INSTALMENT_max_count', 'count'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_max_max', 'max'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_max_min', 'min'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_max_sum', 'sum')], 'POS_CASH_balance_CNT_INSTALMENT_FUTURE_min': [('prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_count', 'count'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_max', 'max'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_min', 'min'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_sum', 'sum')], 'POS_CASH_balance_CNT_INSTALMENT_FUTURE_max': [('prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_count', 'count'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_max', 'max'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_min', 'min'), ('prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_sum', 'sum')]}
{'AMT_CREDIT_SUM': [('bureau_AMT_CREDIT_SUM_count', 'count'), ('bureau_AMT_CREDIT_SUM_max', 'max'), ('bureau_AMT_CREDIT_SUM_min', 'min'), ('bureau_AMT_CREDIT_SUM_sum', 'sum')], 'bureau_balance_MONTHS_BALANCE_min': [('bureau_bureau_balance_MONTHS_BALANCE_min_count', 'count'), ('bureau_bureau_balance_MONTHS_BALANCE_min_max', 'max'), ('bureau_bureau_balance_MONTHS_BALANCE_min_min', 'min'), ('bureau_bureau_balance_MONTHS_BALANCE_min_sum', 'sum')], 'bureau_balance_MONTHS_BALANCE_max': [('bureau_bureau_balance_MONTHS_BALANCE_max_count', 'count'), ('bureau_bureau_balance_MONTHS_BALANCE_max_max', 'max'), ('bureau_bureau_balance_MONTHS_BALANCE_max_min', 'min'), ('bureau_bureau_balance_MONTHS_BALANCE_max_sum', 'sum')]}
prevApps_aggregated = prevApps_feature_pipeline.fit_transform(prevAppsDF)
bureau_aggregated = bureau_feature_pipeline.fit_transform(bureauDF)
Average and Range¶prevApps_aggregated['prevApps_AMT_APPLICATION_avg'] = (
prevApps_aggregated['prevApps_AMT_APPLICATION_sum'] / prevApps_aggregated['prevApps_AMT_APPLICATION_count'] ).replace(np.inf, 0)
prevApps_aggregated['prevApps_AMT_APPLICATION_range'] = (
prevApps_aggregated['prevApps_AMT_APPLICATION_max'] - prevApps_aggregated['prevApps_AMT_APPLICATION_min'] ).replace(np.inf, 0)
bureau_aggregated['bureau_AMT_CREDIT_SUM_avg'] = (
bureau_aggregated['bureau_AMT_CREDIT_SUM_sum'] / bureau_aggregated['bureau_AMT_CREDIT_SUM_count'] ).replace(np.inf, 0)
bureau_aggregated['bureau_AMT_APPLICATION_range'] = (
bureau_aggregated['bureau_AMT_CREDIT_SUM_max'] - bureau_aggregated['bureau_AMT_CREDIT_SUM_min'] ).replace(np.inf, 0)
#prevApps_aggregated.info
num_attributes = prevApps_aggregated.select_dtypes(include=['int64', 'float64']).columns
print(num_attributes)
cat_attributes = prevApps_aggregated.select_dtypes(exclude=['int64', 'float64']).columns
print(cat_attributes)
print('----'*15)
print('Total number of features in feature set 1 - ',(len(num_attributes) + len(cat_attributes)))
print('----'*15)
print('Number of numerical attributes - ',len(num_attributes))
print('Number of categorical attributes - ',len(cat_attributes))
Index(['SK_ID_CURR', 'prevApps_AMT_APPLICATION_count',
'prevApps_AMT_APPLICATION_max', 'prevApps_AMT_APPLICATION_min',
'prevApps_AMT_APPLICATION_sum', 'prevApps_AMT_CREDIT_count',
'prevApps_AMT_CREDIT_max', 'prevApps_AMT_CREDIT_min',
'prevApps_AMT_CREDIT_sum', 'prevApps_AMT_ANNUITY_count',
'prevApps_AMT_ANNUITY_max', 'prevApps_AMT_ANNUITY_min',
'prevApps_AMT_ANNUITY_sum',
'prevApps_installments_payments_DAYS_INSTALMENT_min_count',
'prevApps_installments_payments_DAYS_INSTALMENT_min_max',
'prevApps_installments_payments_DAYS_INSTALMENT_min_min',
'prevApps_installments_payments_DAYS_INSTALMENT_min_sum',
'prevApps_installments_payments_DAYS_INSTALMENT_max_count',
'prevApps_installments_payments_DAYS_INSTALMENT_max_max',
'prevApps_installments_payments_DAYS_INSTALMENT_max_min',
'prevApps_installments_payments_DAYS_INSTALMENT_max_sum',
'prevApps_installments_payments_AMT_INSTALMENT_min_count',
'prevApps_installments_payments_AMT_INSTALMENT_min_max',
'prevApps_installments_payments_AMT_INSTALMENT_min_min',
'prevApps_installments_payments_AMT_INSTALMENT_min_sum',
'prevApps_installments_payments_AMT_INSTALMENT_max_count',
'prevApps_installments_payments_AMT_INSTALMENT_max_max',
'prevApps_installments_payments_AMT_INSTALMENT_max_min',
'prevApps_installments_payments_AMT_INSTALMENT_max_sum',
'prevApps_credit_card_balance_AMT_BALANCE_min_count',
'prevApps_credit_card_balance_AMT_BALANCE_min_max',
'prevApps_credit_card_balance_AMT_BALANCE_min_min',
'prevApps_credit_card_balance_AMT_BALANCE_min_sum',
'prevApps_credit_card_balance_AMT_BALANCE_max_count',
'prevApps_credit_card_balance_AMT_BALANCE_max_max',
'prevApps_credit_card_balance_AMT_BALANCE_max_min',
'prevApps_credit_card_balance_AMT_BALANCE_max_sum',
'prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_count',
'prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_max',
'prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_min',
'prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_sum',
'prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_count',
'prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_max',
'prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_min',
'prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_sum',
'prevApps_credit_card_balance_payment_diff_curr_pay_min_count',
'prevApps_credit_card_balance_payment_diff_curr_pay_min_max',
'prevApps_credit_card_balance_payment_diff_curr_pay_min_min',
'prevApps_credit_card_balance_payment_diff_curr_pay_min_sum',
'prevApps_credit_card_balance_payment_diff_curr_pay_max_count',
'prevApps_credit_card_balance_payment_diff_curr_pay_max_max',
'prevApps_credit_card_balance_payment_diff_curr_pay_max_min',
'prevApps_credit_card_balance_payment_diff_curr_pay_max_sum',
'prevApps_credit_card_balance_payment_diff_min_pay_min_count',
'prevApps_credit_card_balance_payment_diff_min_pay_min_max',
'prevApps_credit_card_balance_payment_diff_min_pay_min_min',
'prevApps_credit_card_balance_payment_diff_min_pay_min_sum',
'prevApps_credit_card_balance_payment_diff_min_pay_max_count',
'prevApps_credit_card_balance_payment_diff_min_pay_max_max',
'prevApps_credit_card_balance_payment_diff_min_pay_max_min',
'prevApps_credit_card_balance_payment_diff_min_pay_max_sum',
'prevApps_POS_CASH_balance_MONTHS_BALANCE_min_count',
'prevApps_POS_CASH_balance_MONTHS_BALANCE_min_max',
'prevApps_POS_CASH_balance_MONTHS_BALANCE_min_min',
'prevApps_POS_CASH_balance_MONTHS_BALANCE_min_sum',
'prevApps_POS_CASH_balance_MONTHS_BALANCE_max_count',
'prevApps_POS_CASH_balance_MONTHS_BALANCE_max_max',
'prevApps_POS_CASH_balance_MONTHS_BALANCE_max_min',
'prevApps_POS_CASH_balance_MONTHS_BALANCE_max_sum',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_min_count',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_min_max',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_min_min',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_min_sum',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_max_count',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_max_max',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_max_min',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_max_sum',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_count',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_max',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_min',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_sum',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_count',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_max',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_min',
'prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_sum',
'prevApps_AMT_APPLICATION_avg', 'prevApps_AMT_APPLICATION_range'],
dtype='object')
Index([], dtype='object')
------------------------------------------------------------
Total number of features in feature set 1 - 87
------------------------------------------------------------
Number of numerical attributes - 87
Number of categorical attributes - 0
merge_all_data = True
appsTrainDF = datasets["application_train"]
X_kaggle_test = datasets["application_test"]
if merge_all_data:
# 1. Join/Merge in prevApps Data
appsTrainDF = appsTrainDF.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
# 2. Join/Merge in bureau Data
appsTrainDF = appsTrainDF.merge(bureau_aggregated, how='left', on="SK_ID_CURR")
X_kaggle_test = X_kaggle_test.merge(bureau_aggregated, how='left', on="SK_ID_CURR")
Percentages¶Days employed percentage = number of days employed / number of days lived
Credit income percentage = credit amount / total income
Annuity income percentage = Annuity amount / total income
# Training dataset
appsTrainDF['DAYS_EMPLOYED_PCT'] = appsTrainDF['DAYS_EMPLOYED'] / appsTrainDF['DAYS_BIRTH']
appsTrainDF['CREDIT_INCOME_PCT'] = appsTrainDF['AMT_CREDIT'] / appsTrainDF['AMT_INCOME_TOTAL']
appsTrainDF['ANNUITY_INCOME_PCT'] = appsTrainDF['AMT_ANNUITY'] / appsTrainDF['AMT_INCOME_TOTAL']
appsTrainDF['CREDIT_TERM'] = appsTrainDF['AMT_ANNUITY'] / appsTrainDF['AMT_CREDIT']
# Test dataset
X_kaggle_test['DAYS_EMPLOYED_PCT'] = X_kaggle_test['DAYS_EMPLOYED'] / X_kaggle_test['DAYS_BIRTH']
X_kaggle_test['CREDIT_INCOME_PCT'] = X_kaggle_test['AMT_CREDIT'] / X_kaggle_test['AMT_INCOME_TOTAL']
X_kaggle_test['ANNUITY_INCOME_PCT'] = X_kaggle_test['AMT_ANNUITY'] / X_kaggle_test['AMT_INCOME_TOTAL']
X_kaggle_test['CREDIT_TERM'] = X_kaggle_test['AMT_ANNUITY'] / X_kaggle_test['AMT_CREDIT']
appsTrainDF[prevApps_aggregated.columns] = appsTrainDF[prevApps_aggregated.columns].fillna(0)
X_kaggle_test[prevApps_aggregated.columns] = X_kaggle_test[prevApps_aggregated.columns].fillna(0)
appsTrainDF[bureau_aggregated.columns] = appsTrainDF[bureau_aggregated.columns].fillna(0)
X_kaggle_test[bureau_aggregated.columns] = X_kaggle_test[bureau_aggregated.columns].fillna(0)
# Create aggregate features (via pipeline)
class polynomialFeatureAdder(BaseEstimator, TransformerMixin):
def __init__(self, features=None, degree=4): # no *args or **kargs
self.features = features
self.polynomial_degree = degree
def fit(self, X, y=None):
return self
def fit_transform(self, X, y=None):
# print("X type from fit_transform",type(X))
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
data = X[self.features]
data_imputed = imp_mean.fit_transform(data)
data = pd.DataFrame(data_imputed, columns=self.features)
# print("imputed data : /n", data)
poly_pipeline = Pipeline([
("poly_transformer",PolynomialFeatures(degree = self.polynomial_degree))
])
poly_n_features = poly_pipeline.fit_transform(data, y)
poly_n_feature_names = poly_pipeline.get_params().get('poly_transformer').get_feature_names()
poly_df_train = pd.DataFrame(poly_n_features, columns= poly_n_feature_names)
return poly_df_train # return dataframe with polynomial features
Adding polynomial features for EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3, DAYS_BIRTH
from sklearn.preprocessing import PolynomialFeatures
poly_features = [ 'EXT_SOURCE_1','EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']
polynomial_features_pipeline = Pipeline([
('poly_adder',polynomialFeatureAdder(poly_features, 4))
])
polyDF = datasets['application_train']
polyDF[poly_features] = polyDF[poly_features].fillna(0)
polynomial_df_train = polynomial_features_pipeline.fit_transform(polyDF)
polynomial_df_train.head(1)
| 1 | x0 | x1 | x2 | x3 | x0^2 | x0 x1 | x0 x2 | x0 x3 | x1^2 | x1 x2 | x1 x3 | x2^2 | x2 x3 | x3^2 | x0^3 | x0^2 x1 | x0^2 x2 | x0^2 x3 | x0 x1^2 | x0 x1 x2 | x0 x1 x3 | x0 x2^2 | x0 x2 x3 | x0 x3^2 | x1^3 | x1^2 x2 | x1^2 x3 | x1 x2^2 | x1 x2 x3 | x1 x3^2 | x2^3 | x2^2 x3 | x2 x3^2 | x3^3 | x0^4 | x0^3 x1 | x0^3 x2 | x0^3 x3 | x0^2 x1^2 | x0^2 x1 x2 | x0^2 x1 x3 | x0^2 x2^2 | x0^2 x2 x3 | x0^2 x3^2 | x0 x1^3 | x0 x1^2 x2 | x0 x1^2 x3 | x0 x1 x2^2 | x0 x1 x2 x3 | x0 x1 x3^2 | x0 x2^3 | x0 x2^2 x3 | x0 x2 x3^2 | x0 x3^3 | x1^4 | x1^3 x2 | x1^3 x3 | x1^2 x2^2 | x1^2 x2 x3 | x1^2 x3^2 | x1 x2^3 | x1 x2^2 x3 | x1 x2 x3^2 | x1 x3^3 | x2^4 | x2^3 x3 | x2^2 x3^2 | x2 x3^3 | x3^4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.083037 | 0.262949 | 0.139376 | -9461.0 | 0.006895 | 0.021834 | 0.011573 | -785.612748 | 0.069142 | 0.036649 | -2487.756636 | 0.019426 | -1318.634256 | 89510521.0 | 0.000573 | 0.001813 | 0.000961 | -65.2349 | 0.005741 | 0.003043 | -206.575767 | 0.001613 | -109.49539 | 7.432682e+06 | 0.018181 | 0.009637 | -654.152107 | 0.005108 | -346.733022 | 2.353667e+07 | 0.002707 | -183.785678 | 1.247560e+07 | -8.468590e+11 | 0.000048 | 0.000151 | 0.00008 | -5.416908 | 0.000477 | 0.000253 | -17.153425 | 0.000134 | -9.092165 | 617187.390589 | 0.00151 | 0.0008 | -54.318807 | 0.000424 | -28.791659 | 1.954413e+06 | 0.000225 | -15.261005 | 1.035936e+06 | -7.032061e+10 | 0.004781 | 0.002534 | -172.008376 | 0.001343 | -91.17296 | 6.188933e+06 | 0.000712 | -48.326185 | 3.280441e+06 | -2.226804e+11 | 0.000377 | -25.615272 | 1.738796e+06 | -1.180316e+11 | 8.012133e+15 |
np.unique(polyDF['EXT_SOURCE_1'])
array([0. , 0.01456813, 0.01469148, ..., 0.94764939, 0.95162396,
0.96269277])
poly_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']
polynomial_features_pipeline = Pipeline([
('poly_adder',polynomialFeatureAdder(poly_features, 4))
])
polyDF = datasets['application_test']
polyDF[poly_features] = polyDF[poly_features].fillna(0)
polynomial_df_test = polynomial_features_pipeline.fit_transform(polyDF)
polynomial_df_train.index = datasets['application_train'].index
polynomial_df_test.index = datasets['application_test'].index
polynomial_df_test.head(1)
| 1 | x0 | x1 | x2 | x3 | x0^2 | x0 x1 | x0 x2 | x0 x3 | x1^2 | x1 x2 | x1 x3 | x2^2 | x2 x3 | x3^2 | x0^3 | x0^2 x1 | x0^2 x2 | x0^2 x3 | x0 x1^2 | x0 x1 x2 | x0 x1 x3 | x0 x2^2 | x0 x2 x3 | x0 x3^2 | x1^3 | x1^2 x2 | x1^2 x3 | x1 x2^2 | x1 x2 x3 | x1 x3^2 | x2^3 | x2^2 x3 | x2 x3^2 | x3^3 | x0^4 | x0^3 x1 | x0^3 x2 | x0^3 x3 | x0^2 x1^2 | x0^2 x1 x2 | x0^2 x1 x3 | x0^2 x2^2 | x0^2 x2 x3 | x0^2 x3^2 | x0 x1^3 | x0 x1^2 x2 | x0 x1^2 x3 | x0 x1 x2^2 | x0 x1 x2 x3 | x0 x1 x3^2 | x0 x2^3 | x0 x2^2 x3 | x0 x2 x3^2 | x0 x3^3 | x1^4 | x1^3 x2 | x1^3 x3 | x1^2 x2^2 | x1^2 x2 x3 | x1^2 x3^2 | x1 x2^3 | x1 x2^2 x3 | x1 x2 x3^2 | x1 x3^3 | x2^4 | x2^3 x3 | x2^2 x3^2 | x2 x3^3 | x3^4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.752614 | 0.789654 | 0.15952 | -19241.0 | 0.566429 | 0.594305 | 0.120057 | -14481.055414 | 0.623554 | 0.125965 | -15193.73937 | 0.025446 | -3069.315478 | 370216081.0 | 0.426302 | 0.447283 | 0.090356 | -10898.652144 | 0.469296 | 0.094803 | -11435.028416 | 0.019151 | -2310.011305 | 2.786300e+08 | 0.492392 | 0.099469 | -11997.802403 | 0.020094 | -2423.698322 | 2.923427e+08 | 0.004059 | -489.615795 | 5.905670e+07 | -7.123328e+12 | 0.320841 | 0.336632 | 0.068004 | -8202.483531 | 0.353199 | 0.07135 | -8606.168086 | 0.014414 | -1738.547982 | 2.097010e+08 | 0.370581 | 0.074862 | -9029.719944 | 0.015123 | -1824.110478 | 2.200214e+08 | 0.003055 | -368.491942 | 4.444693e+07 | -5.361120e+12 | 0.38882 | 0.078546 | -9474.116872 | 0.015867 | -1913.883926 | 2.308497e+08 | 0.003205 | -386.627243 | 4.663438e+07 | -5.624967e+12 | 0.000648 | -78.103287 | 9.420698e+06 | -1.136310e+12 | 1.370599e+17 |
appsTrainDFpoly = pd.concat([appsTrainDF, polynomial_df_train])
X_kaggle_test_poly = pd.concat([X_kaggle_test, polynomial_df_test])
appsTrainDF.select_dtypes(include=['int64', 'float64']).columns
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
...
'bureau_bureau_balance_MONTHS_BALANCE_max_count',
'bureau_bureau_balance_MONTHS_BALANCE_max_max',
'bureau_bureau_balance_MONTHS_BALANCE_max_min',
'bureau_bureau_balance_MONTHS_BALANCE_max_sum',
'bureau_AMT_CREDIT_SUM_avg', 'bureau_AMT_APPLICATION_range',
'DAYS_EMPLOYED_PCT', 'CREDIT_INCOME_PCT', 'ANNUITY_INCOME_PCT',
'CREDIT_TERM'],
dtype='object', length=213)
appsTrainDF.select_dtypes(exclude=['int64', 'float64']).columns
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
dtype='object')
appsTrainDF.dtypes.value_counts()
float64 172 int64 41 object 16 dtype: int64
correlation_with_all_features = appsTrainDF.corr()
correlation_with_all_features['TARGET'].sort_values()
EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 FLOORSMAX_AVG -0.044003 FLOORSMAX_MEDI -0.043768 FLOORSMAX_MODE -0.043226 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 ELEVATORS_AVG -0.034199 ELEVATORS_MEDI -0.033863 FLOORSMIN_AVG -0.033614 FLOORSMIN_MEDI -0.033394 LIVINGAREA_AVG -0.032997 LIVINGAREA_MEDI -0.032739 FLOORSMIN_MODE -0.032698 TOTALAREA_MODE -0.032596 ELEVATORS_MODE -0.032131 LIVINGAREA_MODE -0.030685 AMT_CREDIT -0.030369 APARTMENTS_AVG -0.029498 APARTMENTS_MEDI -0.029184 FLAG_DOCUMENT_6 -0.028602 APARTMENTS_MODE -0.027284 LIVINGAPARTMENTS_AVG -0.025031 LIVINGAPARTMENTS_MEDI -0.024621 HOUR_APPR_PROCESS_START -0.024166 FLAG_PHONE -0.023806 LIVINGAPARTMENTS_MODE -0.023393 BASEMENTAREA_AVG -0.022746 YEARS_BUILD_MEDI -0.022326 YEARS_BUILD_AVG -0.022149 BASEMENTAREA_MEDI -0.022081 YEARS_BUILD_MODE -0.022068 BASEMENTAREA_MODE -0.019952 ENTRANCES_AVG -0.019172 ENTRANCES_MEDI -0.019025 COMMONAREA_MEDI -0.018573 COMMONAREA_AVG -0.018550 bureau_AMT_CREDIT_SUM_avg -0.017834 ENTRANCES_MODE -0.017387 bureau_AMT_CREDIT_SUM_max -0.016850 COMMONAREA_MODE -0.016340 bureau_AMT_APPLICATION_range -0.014385 bureau_AMT_CREDIT_SUM_sum -0.013915 NONLIVINGAREA_AVG -0.013578 NONLIVINGAREA_MEDI -0.013337 AMT_ANNUITY -0.012817 NONLIVINGAREA_MODE -0.012711 AMT_REQ_CREDIT_BUREAU_MON -0.012462 FLAG_DOCUMENT_16 -0.011615 bureau_AMT_CREDIT_SUM_min -0.011596 FLAG_DOCUMENT_13 -0.011583 LANDAREA_MEDI -0.011256 LANDAREA_AVG -0.010885 LANDAREA_MODE -0.010174 YEARS_BEGINEXPLUATATION_MEDI -0.009993 YEARS_BEGINEXPLUATATION_AVG -0.009728 FLAG_DOCUMENT_14 -0.009464 YEARS_BEGINEXPLUATATION_MODE -0.009036 FLAG_DOCUMENT_8 -0.008040 FLAG_DOCUMENT_18 -0.007952 CREDIT_INCOME_PCT -0.007727 prevApps_POS_CASH_balance_CNT_INSTALMENT_min_count -0.007273 prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_count -0.007273 prevApps_POS_CASH_balance_CNT_INSTALMENT_max_count -0.007273 prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_count -0.007273 prevApps_POS_CASH_balance_MONTHS_BALANCE_min_count -0.007229 prevApps_POS_CASH_balance_MONTHS_BALANCE_max_count -0.007229 ef_FAM_MEMBER_INCOME -0.006571 FLAG_DOCUMENT_15 -0.006536 prevApps_POS_CASH_balance_CNT_INSTALMENT_min_max -0.006307 prevApps_POS_CASH_balance_CNT_INSTALMENT_min_sum -0.006198 prevApps_POS_CASH_balance_CNT_INSTALMENT_min_min -0.006183 bureau_AMT_CREDIT_SUM_count -0.004572 FLAG_DOCUMENT_9 -0.004352 FLAG_DOCUMENT_11 -0.004229 AMT_INCOME_TOTAL -0.003982 prevApps_installments_payments_DAYS_INSTALMENT_max_count -0.003778 prevApps_installments_payments_AMT_INSTALMENT_min_count -0.003778 prevApps_installments_payments_DAYS_INSTALMENT_min_count -0.003778 prevApps_installments_payments_AMT_INSTALMENT_max_count -0.003778 prevApps_POS_CASH_balance_CNT_INSTALMENT_max_max -0.003722 prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_max -0.003667 prevApps_AMT_ANNUITY_min -0.003580 prevApps_AMT_ANNUITY_max -0.003554 prevApps_POS_CASH_balance_CNT_INSTALMENT_max_sum -0.003501 prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_sum -0.003440 FLAG_DOCUMENT_17 -0.003378 prevApps_POS_CASH_balance_CNT_INSTALMENT_max_min -0.003373 prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_max_min -0.003323 NONLIVINGAPARTMENTS_AVG -0.003176 prevApps_AMT_ANNUITY_sum -0.003174 prevApps_installments_payments_AMT_INSTALMENT_min_sum -0.002886 prevApps_installments_payments_AMT_INSTALMENT_min_min -0.002798 prevApps_installments_payments_AMT_INSTALMENT_min_max -0.002794 NONLIVINGAPARTMENTS_MEDI -0.002757 FLAG_DOCUMENT_4 -0.002672 prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_sum -0.002321 prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_max -0.002307 prevApps_POS_CASH_balance_CNT_INSTALMENT_FUTURE_min_min -0.002254 SK_ID_CURR -0.002108 AMT_REQ_CREDIT_BUREAU_QRT -0.002022 ef_INCOME_CREDIT_PERCENT -0.001817 FLAG_EMAIL -0.001758 NONLIVINGAPARTMENTS_MODE -0.001557 FLAG_DOCUMENT_7 -0.001520 FLAG_DOCUMENT_10 -0.001414 FLAG_DOCUMENT_19 -0.001358 prevApps_credit_card_balance_payment_diff_min_pay_min_sum -0.001332 prevApps_credit_card_balance_payment_diff_min_pay_min_min -0.001332 prevApps_credit_card_balance_payment_diff_min_pay_min_max -0.001332 FLAG_DOCUMENT_12 -0.000756 prevApps_installments_payments_AMT_INSTALMENT_max_max -0.000704 prevApps_installments_payments_AMT_INSTALMENT_max_sum -0.000660 prevApps_installments_payments_AMT_INSTALMENT_max_min -0.000548 FLAG_DOCUMENT_5 -0.000316 prevApps_AMT_APPLICATION_max -0.000204 prevApps_AMT_APPLICATION_min -0.000193 prevApps_AMT_APPLICATION_avg -0.000155 prevApps_AMT_APPLICATION_range -0.000073 prevApps_credit_card_balance_payment_diff_curr_pay_min_sum 0.000077 prevApps_credit_card_balance_payment_diff_curr_pay_min_min 0.000077 prevApps_credit_card_balance_payment_diff_curr_pay_min_max 0.000077 FLAG_DOCUMENT_20 0.000215 prevApps_AMT_APPLICATION_sum 0.000220 prevApps_credit_card_balance_payment_diff_curr_pay_max_sum 0.000305 prevApps_credit_card_balance_payment_diff_curr_pay_max_min 0.000305 prevApps_credit_card_balance_payment_diff_curr_pay_max_max 0.000305 FLAG_CONT_MOBILE 0.000370 prevApps_AMT_CREDIT_min 0.000507 FLAG_MOBIL 0.000534 prevApps_AMT_CREDIT_max 0.000564 prevApps_credit_card_balance_payment_diff_min_pay_max_count 0.000606 prevApps_credit_card_balance_payment_diff_min_pay_min_count 0.000606 prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_sum 0.000665 prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_min 0.000665 prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_max 0.000665 prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_count 0.000723 prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_min_count 0.000723 prevApps_credit_card_balance_AMT_BALANCE_max_count 0.000723 prevApps_credit_card_balance_AMT_BALANCE_min_count 0.000723 AMT_REQ_CREDIT_BUREAU_WEEK 0.000788 prevApps_AMT_CREDIT_sum 0.000882 AMT_REQ_CREDIT_BUREAU_HOUR 0.000930 prevApps_AMT_ANNUITY_count 0.001453 bureau_bureau_balance_MONTHS_BALANCE_min_count 0.001617 bureau_bureau_balance_MONTHS_BALANCE_max_count 0.001617 prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_sum 0.001840 prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_max 0.001840 prevApps_credit_card_balance_AMT_DRAWINGS_CURRENT_max_min 0.001840 prevApps_credit_card_balance_payment_diff_min_pay_max_sum 0.002366 prevApps_credit_card_balance_payment_diff_min_pay_max_min 0.002366 prevApps_credit_card_balance_payment_diff_min_pay_max_max 0.002366 AMT_REQ_CREDIT_BUREAU_DAY 0.002704 LIVE_REGION_NOT_WORK_REGION 0.002819 bureau_bureau_balance_MONTHS_BALANCE_max_max 0.003068 FLAG_DOCUMENT_21 0.003709 prevApps_credit_card_balance_payment_diff_curr_pay_max_count 0.005164 prevApps_credit_card_balance_payment_diff_curr_pay_min_count 0.005164 FLAG_DOCUMENT_2 0.005417 prevApps_AMT_APPLICATION_count 0.005480 prevApps_AMT_CREDIT_count 0.005480 REG_REGION_NOT_LIVE_REGION 0.005576 prevApps_credit_card_balance_AMT_BALANCE_min_sum 0.005666 prevApps_credit_card_balance_AMT_BALANCE_min_min 0.005666 prevApps_credit_card_balance_AMT_BALANCE_min_max 0.005666 bureau_bureau_balance_MONTHS_BALANCE_max_sum 0.005990 prevApps_credit_card_balance_AMT_BALANCE_max_sum 0.006652 prevApps_credit_card_balance_AMT_BALANCE_max_max 0.006652 prevApps_credit_card_balance_AMT_BALANCE_max_min 0.006652 prevApps_installments_payments_DAYS_INSTALMENT_min_sum 0.006708 prevApps_installments_payments_DAYS_INSTALMENT_max_sum 0.006902 REG_REGION_NOT_WORK_REGION 0.006942 prevApps_installments_payments_DAYS_INSTALMENT_min_max 0.007097 prevApps_installments_payments_DAYS_INSTALMENT_max_max 0.007100 prevApps_installments_payments_DAYS_INSTALMENT_max_min 0.007129 prevApps_installments_payments_DAYS_INSTALMENT_min_min 0.007142 bureau_bureau_balance_MONTHS_BALANCE_max_min 0.008469 OBS_60_CNT_SOCIAL_CIRCLE 0.009022 OBS_30_CNT_SOCIAL_CIRCLE 0.009131 prevApps_POS_CASH_balance_MONTHS_BALANCE_max_sum 0.009143 prevApps_POS_CASH_balance_MONTHS_BALANCE_min_sum 0.009174 CNT_FAM_MEMBERS 0.009308 prevApps_POS_CASH_balance_MONTHS_BALANCE_max_max 0.009322 prevApps_POS_CASH_balance_MONTHS_BALANCE_max_min 0.009380 prevApps_POS_CASH_balance_MONTHS_BALANCE_min_min 0.009393 prevApps_POS_CASH_balance_MONTHS_BALANCE_min_max 0.009423 CREDIT_TERM 0.012704 bureau_bureau_balance_MONTHS_BALANCE_min_sum 0.014166 ef_ANN_INCOME_PERCENT 0.014265 ANNUITY_INCOME_PCT 0.014265 bureau_bureau_balance_MONTHS_BALANCE_min_max 0.017043 bureau_bureau_balance_MONTHS_BALANCE_min_min 0.018485 CNT_CHILDREN 0.019187 AMT_REQ_CREDIT_BUREAU_YEAR 0.019930 FLAG_WORK_PHONE 0.028524 DEF_60_CNT_SOCIAL_CIRCLE 0.031276 DEF_30_CNT_SOCIAL_CIRCLE 0.032248 LIVE_CITY_NOT_WORK_CITY 0.032518 OWN_CAR_AGE 0.037612 DAYS_REGISTRATION 0.041975 DAYS_EMPLOYED_PCT 0.042206 FLAG_DOCUMENT_3 0.044346 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 DAYS_LAST_PHONE_CHANGE 0.055218 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64
# set this value to choose the number of positive and negative correlated features
n_val = 15
print("---"*15)
print("---"*15)
print(" Total correlation of all the features. " )
print("---"*15)
print("---"*15)
print(f"Top {n_val} negative correlated features")
print()
print(correlation_with_all_features.TARGET.sort_values(ascending = True).head(n_val))
print()
print()
print(f"Top {n_val} positive correlated features")
print()
print(correlation_with_all_features.TARGET.sort_values(ascending = True).tail(n_val))
---------------------------------------------
---------------------------------------------
Total correlation of all the features.
---------------------------------------------
---------------------------------------------
Top 15 negative correlated features
EXT_SOURCE_3 -0.178919
EXT_SOURCE_2 -0.160472
EXT_SOURCE_1 -0.155317
DAYS_EMPLOYED -0.044932
FLOORSMAX_AVG -0.044003
FLOORSMAX_MEDI -0.043768
FLOORSMAX_MODE -0.043226
AMT_GOODS_PRICE -0.039645
REGION_POPULATION_RELATIVE -0.037227
ELEVATORS_AVG -0.034199
ELEVATORS_MEDI -0.033863
FLOORSMIN_AVG -0.033614
FLOORSMIN_MEDI -0.033394
LIVINGAREA_AVG -0.032997
LIVINGAREA_MEDI -0.032739
Name: TARGET, dtype: float64
Top 15 positive correlated features
DEF_30_CNT_SOCIAL_CIRCLE 0.032248
LIVE_CITY_NOT_WORK_CITY 0.032518
OWN_CAR_AGE 0.037612
DAYS_REGISTRATION 0.041975
DAYS_EMPLOYED_PCT 0.042206
FLAG_DOCUMENT_3 0.044346
REG_CITY_NOT_LIVE_CITY 0.044395
FLAG_EMP_PHONE 0.045982
REG_CITY_NOT_WORK_CITY 0.050994
DAYS_ID_PUBLISH 0.051457
DAYS_LAST_PHONE_CHANGE 0.055218
REGION_RATING_CLIENT 0.058899
REGION_RATING_CLIENT_W_CITY 0.060893
DAYS_BIRTH 0.078239
TARGET 1.000000
Name: TARGET, dtype: float64
tf_apps_train_final = []
featureslist1 = correlation_with_all_features.TARGET.sort_values(ascending = True)[:n_val].index.tolist()
featureslist2 = correlation_with_all_features.TARGET.sort_values(ascending = True)[-n_val:].index.tolist()
tf_apps_train_final = featureslist1 + featureslist2
len(tf_apps_train_final)
30
for idx in tf_apps_train_final:
print(f"{idx:50} {appsTrainDF[idx].dtypes}")
EXT_SOURCE_3 float64 EXT_SOURCE_2 float64 EXT_SOURCE_1 float64 DAYS_EMPLOYED int64 FLOORSMAX_AVG float64 FLOORSMAX_MEDI float64 FLOORSMAX_MODE float64 AMT_GOODS_PRICE float64 REGION_POPULATION_RELATIVE float64 ELEVATORS_AVG float64 ELEVATORS_MEDI float64 FLOORSMIN_AVG float64 FLOORSMIN_MEDI float64 LIVINGAREA_AVG float64 LIVINGAREA_MEDI float64 DEF_30_CNT_SOCIAL_CIRCLE float64 LIVE_CITY_NOT_WORK_CITY int64 OWN_CAR_AGE float64 DAYS_REGISTRATION float64 DAYS_EMPLOYED_PCT float64 FLAG_DOCUMENT_3 int64 REG_CITY_NOT_LIVE_CITY int64 FLAG_EMP_PHONE int64 REG_CITY_NOT_WORK_CITY int64 DAYS_ID_PUBLISH int64 DAYS_LAST_PHONE_CHANGE float64 REGION_RATING_CLIENT int64 REGION_RATING_CLIENT_W_CITY int64 DAYS_BIRTH int64 TARGET int64
modeling_num_attrib = []
modeling_cat_attrib = []
for idx in tf_apps_train_final:
if appsTrainDF[idx].dtypes in ['int64', 'float64']:
modeling_num_attrib.append(idx)
else:
modeling_cat_attrib.append(idx)
print('Number of numerical features - ',len(modeling_num_attrib))
print('Number of categorical features - ',len(modeling_cat_attrib))
Number of numerical features - 30 Number of categorical features - 0
We have selected the top 10 correlated features for building the baseline pipeline
X_train_merge=pd.concat([app_train.TARGET, df])
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
# Get only numeric columns
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = X_train_merge.select_dtypes(include=numerics)
print("Feature data dimension: ", newdf.shape)
X = newdf.iloc[:,:-1]
y = X_train_merge['TARGET']
print(X.shape)
print(y.shape)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.10, random_state=42)
kbest = SelectKBest(chi2)
pipeline = Pipeline([('kbest', kbest), ('lr', LogisticRegression())])
#pipeline = Pipeline([('kbest', kbest), ('linear', AdaBoostClassifier())])
#grid_search = GridSearchCV(pipeline, {'kbest__k': [10,20,30], 'lr__C': np.logspace(-10, 10, 5)})
grid_search = GridSearchCV(pipeline, {'kbest__k': [10,20], 'linear__n_estimators': [10, 20], 'linear__learning_rate': [0.1, 1]}, cv=5, verbose=1)
feat_select = grid_search.fit(X_train, y_train)
from sklearn.base import BaseEstimator, TransformerMixin
import re
# Creates the following date features
# But could do so much more with these features
# E.g.,
# extract the domain address of the homepage and OneHotEncode it
#
# ['release_month','release_day','release_year', 'release_dayofweek','release_quarter']
class prep_OCCUPATION_TYPE(BaseEstimator, TransformerMixin):
def __init__(self, features="OCCUPATION_TYPE"): # no *args or **kargs
self.features = features
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X):
df = pd.DataFrame(X, columns=self.features)
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.)
#df.drop(self.features, axis=1, inplace=True)
return np.array(df.values) #return a Numpy Array to observe the pipeline protocol
from sklearn.pipeline import make_pipeline
features = ["OCCUPATION_TYPE"]
def test_driver_prep_OCCUPATION_TYPE():
print(f"X_train.shape: {X_train.shape}\n")
print(f"X_train['name'][0:5]: \n{X_train[features][0:5]}")
test_pipeline = make_pipeline(prep_OCCUPATION_TYPE(features))
return(test_pipeline.fit_transform(X_train))
x = test_driver_prep_OCCUPATION_TYPE()
print(f"Test driver: \n{test_driver_prep_OCCUPATION_TYPE()[0:10, :]}")
print(f"X_train['name'][0:10]: \n{X_train[features][0:10]}")
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
X_train.shape: (307511, 132) X_train['name'][0:5]: OCCUPATION_TYPE 0 Laborers 1 Core staff 2 Laborers 3 Laborers 4 Core staff X_train.shape: (307511, 132) X_train['name'][0:5]: OCCUPATION_TYPE 0 Laborers 1 Core staff 2 Laborers 3 Laborers 4 Core staff Test driver: [[0.] [0.] [0.] [0.] [0.] [0.] [1.] [1.] [0.] [0.]] X_train['name'][0:10]: OCCUPATION_TYPE 0 Laborers 1 Core staff 2 Laborers 3 Laborers 4 Core staff 5 Laborers 6 Accountants 7 Managers 8 NaN 9 Laborers
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Identify the numeric features we wish to consider.
num_attribs = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
'EXT_SOURCE_2','EXT_SOURCE_3']
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler()),
])
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
#('imputer', SimpleImputer(strategy='most_frequent')),
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_prep_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
list(datasets["application_train"].columns)
['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'ef_INCOME_CREDIT_PERCENT', 'ef_FAM_MEMBER_INCOME', 'ef_ANN_INCOME_PERCENT']
datasets["application_train"].columns
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY',
...
'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR', 'ef_INCOME_CREDIT_PERCENT',
'ef_FAM_MEMBER_INCOME', 'ef_ANN_INCOME_PERCENT'],
dtype='object', length=125)
selected_features = num_attribs + cat_attribs
tot_features = f"{len(selected_features)}: Num:{len(num_attribs)}, Cat:{len(cat_attribs)}"
#Total Feature selected for processing
tot_features
'14: Num:7, Cat:7'
train_dataset=appsTrainDF
class_labels = ["No Default","Default"]
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC"
])
To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model
def pct(x):
return round(100*x,3)
# roc curve, precision recall curve for each model
fprs, tprs, precisions, recalls, names, scores, cvscores, pvalues, accuracy, cnfmatrix = list(), list(), list(), list(), list(), list(), list(), list(), list(), list()
features_list, final_best_clf,results = {}, {},[]
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.
from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
Binary classification loss functions that can be used to calculate the loss/error and to update/rearrange the feature weights accordingly are -
Cross - entropy loss This is the default loss function used for most of the binary classification problems. It is similar to maximum likelihood as it is used to calculate a score that summarizes the average difference between the actual and predicted probability distributions for predicting class.
Hinge loss - This loss function is used for SVM models. It checks the examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values.
Squared-hinge loss A popular extension to Hinge loss is called the squared hinge loss that simply calculates the square of the score hinge loss.
Binary classification loss functions that can be used to calculate the loss/error and to update/rearrange the feature weights accordingly are -
Cross - entropy loss This is the default loss function used for most of the binary classification problems. It is similar to maximum likelihood as it is used to calculate a score that summarizes the average difference between the actual and predicted probability distributions for predicting class.
Hinge loss - This loss function is used for SVM models. It checks the examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values.
Squared-hinge loss A popular extension to Hinge loss is called the squared hinge loss that simply calculates the square of the score hinge loss.
We have used cross-entropy loss function which would be most suitable loss function for this binary classification problem.
Few metrics are sensitive to imbalance in the data which should not affect the performance of the model. Therefore, metric measures that are helpful are -
Error = (FP + FN) / (TP + TN + FP + FN)
Precision = (TP) / (TP + FP)
Recall Recall turns out to be a good measure for calculating error for imbalanced data. The following formula can be used to calculate recall measure -
Recall = (TP) / (TP + FN)
ROC (Receiver Operating Characteristic) - ROC is used to summarize the tradeoff between true positive rate and false positive rate. The ROC curve tends to be helpful as a metric for error correction. The AUC (Area Under Curve) can be used to maintain the balance between true positive rate and false positive rate by adjusting the parameters accordingly
def precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name):
# plot precision_recall Test
precision, recall, threshold = precision_recall_curve(y_test,model.predict_proba(X_test)[:, 1])
precisions.append(precision)
recalls.append(recall)
# plot combined Precision Recall curve for train, valid, test
show_train_precision = plot_precision_recall_curve(model, X_train, y_train, name="TrainPresRecal")
show_test_precision = plot_precision_recall_curve(model, X_test, y_test, name="TestPresRecal", ax=show_train_precision.ax_)
show_valid_precision = plot_precision_recall_curve(model, X_valid, y_valid, name="ValidPresRecal", ax=show_test_precision.ax_)
show_valid_precision.ax_.set_title ("Precision Recall Curve Comparison - " + name)
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.show()
return precisions,recalls
def confusion_matrix_def(model,X_train,y_train,X_test, y_test, X_valid, y_valid,cnfmatrix):
#Prediction
preds_test = model.predict(X_test)
preds_train = model.predict(X_train)
preds_valid = model.predict(X_valid)
cm_train = confusion_matrix(y_train, preds_train).astype(np.float32)
#print(cm_train)
cm_train /= cm_train.sum(axis=1)[:, np.newaxis]
cm_test = confusion_matrix(y_test, preds_test).astype(np.float32)
#print(cm_test)
cm_test /= cm_test.sum(axis=1)[:, np.newaxis]
cm_valid = confusion_matrix(y_valid, preds_valid).astype(np.float32)
cm_valid /= cm_valid.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(16, 4))
#plt.subplots(1,3,figsize=(12,4))
plt.subplot(131)
g = sns.heatmap(cm_train, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Train", fontsize=14)
plt.subplot(132)
g = sns.heatmap(cm_valid, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Validation set", fontsize=14);
plt.subplot(133)
g = sns.heatmap(cm_test, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Test", fontsize=14);
cnfmatrix.append(cm_test)
return cnfmatrix
def roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name):
fpr, tpr, threshold = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
fprs.append(fpr)
tprs.append(tpr)
# plot combined ROC curve for train, valid, test
show_train_roc = plot_roc_curve(model, X_train, y_train, name="TrainRocAuc")
show_test_roc = plot_roc_curve(model, X_test, y_test, name="TestRocAuc", ax=show_train_roc.ax_)
show_valid_roc = plot_roc_curve(model, X_valid, y_valid, name="ValidRocAuc", ax=show_test_roc.ax_)
show_valid_roc.ax_.set_title ("ROC Curve Comparison - " + name)
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.show()
return fprs,tprs
for col in selected_features:
if col not in train_dataset.columns:
selected_features.remove(col)
# Split Sample to feed the pipeline and it will result in a new dataset that is (1 / splits) the size
splits = 50
# Train Test split percentage
subsample_rate = 0.3
finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
X_kaggle_test= X_kaggle_test[selected_features]
## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
test_size=subsample_rate, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X kaggle_test shape: {X_kaggle_test.shape}")
X train shape: (3659, 14) X validation shape: (646, 14) X test shape: (1846, 14) X kaggle_test shape: (48744, 14)
%%time
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("linear", LogisticRegression())
])
model = full_pipeline_with_predictor.fit(X_train, y_train)
CPU times: user 212 ms, sys: 155 ms, total: 367 ms Wall time: 602 ms
from time import time, ctime
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
#from lightgbm import LGBMClassifier
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, log_loss, classification_report, roc_auc_score, make_scorer
from scipy import stats
import json
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, make_scorer, roc_curve, ConfusionMatrixDisplay, precision_recall_curve
from sklearn.metrics import explained_variance_score
from sklearn.metrics import plot_roc_curve, plot_confusion_matrix, plot_precision_recall_curve
cvSplits = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
metrics = {'accuracy': make_scorer(accuracy_score),
'roc_auc': 'roc_auc',
'f1': make_scorer(f1_score),
'log_loss': make_scorer(log_loss)
}
start = time()
model = full_pipeline_with_predictor.fit(X_train, y_train)
np.random.seed(42)
# Set up cross validation scores
logit_scores = cross_validate(model, X_train, y_train,cv=cvSplits,scoring=metrics, return_train_score=True, n_jobs=-1)
train_time = np.round(time() - start, 4)
# Time and score valid predictions
start = time()
logit_score_valid = full_pipeline_with_predictor.score(X_valid, y_valid)
valid_time = np.round(time() - start, 4)
# Time and score test predictions
start = time()
logit_score_test = full_pipeline_with_predictor.score(X_test, y_test)
test_time = np.round(time() - start, 4)
from sklearn.metrics import accuracy_score
np.round(accuracy_score(y_train, model.predict(X_train)), 3)
0.921
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, model.predict(X_train)),
accuracy_score(y_valid, model.predict(X_valid)),
accuracy_score(y_test, model.predict(X_test)),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])],
4))
expLog
| exp_name | Train Acc | Valid Acc | Test Acc | Train AUC | Valid AUC | Test AUC | |
|---|---|---|---|---|---|---|---|
| 0 | Baseline_14_features | 0.921 | 0.9195 | 0.9209 | 0.7725 | 0.8078 | 0.6971 |
_,_=roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,"Baseline Logistic Regression Model")
_=confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
0.7724581162449694
_,_=precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,"Baseline Logistic Regression Model")
Resampling minority class
train_data = pd.concat([X_train, y_train], axis=1)
train_data.head()
| AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_EMPLOYED | DAYS_BIRTH | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | CODE_GENDER | FLAG_OWN_REALTY | FLAG_OWN_CAR | NAME_CONTRACT_TYPE | NAME_EDUCATION_TYPE | OCCUPATION_TYPE | NAME_INCOME_TYPE | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5362 | 157500.0 | 312768.0 | -1093 | -14502 | 0.382308 | 0.207453 | 0.148254 | F | Y | N | Cash loans | Secondary / secondary special | Laborers | Working | 0 |
| 5150 | 112500.0 | 397881.0 | -2201 | -14875 | NaN | 0.082266 | 0.729567 | F | Y | N | Cash loans | Secondary / secondary special | Sales staff | Working | 0 |
| 5841 | 51750.0 | 135000.0 | 365243 | -20383 | NaN | 0.591825 | NaN | F | Y | N | Revolving loans | Secondary / secondary special | NaN | Pensioner | 0 |
| 2399 | 171000.0 | 675000.0 | -3313 | -19026 | 0.605392 | 0.572937 | 0.508287 | F | Y | N | Cash loans | Secondary / secondary special | NaN | Working | 0 |
| 2564 | 157500.0 | 505642.5 | -4028 | -15010 | 0.729739 | 0.706836 | 0.440058 | F | Y | N | Cash loans | Secondary / secondary special | Laborers | Commercial associate | 0 |
no_default_data = train_data[train_data.TARGET==0]
default_data = train_data[train_data.TARGET==1]
# sample minority
default_sampled_data = resample(default_data,
replace=True, # sample with replacement
n_samples=len(no_default_data), # match number in majority class
random_state=42) # reproducible
# combine majority and upsampled minority
train_data = pd.concat([no_default_data, default_sampled_data])
train_data.TARGET.value_counts()
0 3374 1 3374 Name: TARGET, dtype: int64
y_train = train_data['TARGET']
X_train = train_data[selected_features]
%%time
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("linear", LogisticRegression())
])
CPU times: user 212 µs, sys: 0 ns, total: 212 µs Wall time: 203 µs
Baseline metrics
classifiers = [
[('Logistic Regression', LogisticRegression(solver='saga',random_state=42),"RFE")],
[('Support Vector', SVC(random_state=42,probability=True),"SVM")],
[('Gradient Boosting', GradientBoostingClassifier(warm_start=True, random_state=42),"RFE")],
[('XGBoost', XGBClassifier(random_state=42),"RFE")],
# [('Light GBM', LGBMClassifier(boosting_type='gbdt', random_state=42),"RFE")],
[('RandomForest', RandomForestClassifier(random_state=42),"RFE")]
]
# Arrange grid search parameters for each classifier
params_grid = {
'Logistic Regression': {
'penalty': ('l1', 'l2','elasticnet'),
'tol': (0.0001, 0.00001),
'C': (10, 1, 0.1, 0.01),
}
,
'Support Vector' : {
'kernel': ('rbf','poly'),
'degree': (4, 5),
'C': ( 0.001, 0.01), #Low C - allow for misclassification
'gamma':(0.01,0.1,1) #Low gamma - high variance and low bias
}
,
'Gradient Boosting': {
'max_depth': [5,10], # Lower helps with overfitting
'max_features': [10,15],
'validation_fraction': [0.2],
'n_iter_no_change': [10],
'tol': [0.01,0.0001],
'n_estimators':[1000],
'subsample' : [0.8], #fraction of observations to be randomly samples for each tree.
# 'min_samples_split' : [5], # Must have 'x' number of samples to split (Default = 2)
'min_samples_leaf' : [3,5], # (Default = 1) minimum number of samples in a leaf
},
'XGBoost': {
'max_depth': [3,5], # Lower helps with overfitting
'n_estimators':[300,500],
'learning_rate': [0.01,0.1],
# 'objective': ['binary:logistic'],
# 'eval_metric': ['auc'],
'eta' : [0.01,0.1],
'colsample_bytree' : [0.2,0.5],
},
'RandomForest': {
'max_depth': [5,10],
'max_features': [15,20],
'min_samples_split': [5, 10],
'min_samples_leaf': [3, 5],
'bootstrap': [True],
'n_estimators':[1000]},
}
# Set feature selection settings
# Features removed each step
feature_selection_steps=10
# Number of features used
features_used=len(selected_features)
results.append(logit_scores['train_accuracy'])
names = ['Baseline LR']
def ConductGridSearch(in_classifiers,cnfmatrix,fprs,tprs,precisions,recalls):
for (name, classifier,feature_sel) in in_classifiers:
# Print classifier and parameters
print('****** START', name,'*****')
parameters = params_grid[name]
print("Parameters:")
for p in sorted(parameters.keys()):
print("\t"+str(p)+": "+ str(parameters[p]))
# generate the pipeline based on the feature selection method
if feature_sel == "SVM":
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
# ("PCA",PCA(0.95)),
# ('RFE', RFE(estimator=classifier, n_features_to_select=features_used, step=feature_selection_steps)),
("predictor", classifier)
])
else:
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
('RFE', RFE(estimator=classifier, n_features_to_select=features_used, step=feature_selection_steps)),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
# Best estimator score
best_train = pct(grid_search.best_score_)
# Best train scores
print("Cross validation with best estimator")
best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics,
return_train_score=True, n_jobs=-1)
#get all scores
best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)
valid_time = np.round(best_train_scores['score_time'].mean(),4)
best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)
#append all results
results.append(best_train_scores['train_accuracy'])
names.append(name)
# Conduct t-test with baseline logit (control) and best estimator (experiment)
(t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])
#test and Prediction with whole data
# Best estimator fitting time
print("Fit and Prediction with best estimator")
start = time()
model = grid_search.best_estimator_.fit(X_train, y_train)
train_time = round(time() - start, 4)
# Best estimator prediction time
start = time()
y_test_pred = model.predict(X_test)
test_time = round(time() - start, 4)
scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
accuracy.append(accuracy_score(y_test, y_test_pred))
# Create confusion matrix for the best model
cnfmatrix = confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)
# Create AUC ROC curve
fprs,tprs = roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name)
#Create Precision recall curve
precisions,recalls = precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name)
#Best Model
final_best_clf[name]=pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
'predictor': grid_search.best_estimator_.named_steps['predictor']}])
#Feature importance
feature_name = num_attribs + list(grid_search.best_estimator_.named_steps['preparation'].transformer_list[1][1].named_steps['ohe'].get_feature_names())
feature_list = feature_name
if feature_sel == "RFE":
# features_list[name]=pd.DataFrame({'feature_name': feature_list,
# 'feature_importance': grid_search.best_estimator_.named_steps['PCA'].explained_variance_ratio_})
# 'feature_importance': grid_search.best_estimator_.named_steps['RFE'].ranking_})
# print(len(feature_list),feature_list)
# print(len(grid_search.best_estimator_.named_steps['RFE'].ranking_),
# grid_search.best_estimator_.named_steps['RFE'].ranking_)
features_list[name]=pd.DataFrame({'feature_name': feature_list,
'feature_importance': grid_search.best_estimator_.named_steps['RFE'].ranking_})
# Collect the best parameters found by the grid search
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
param_dump = []
for param_name in sorted(params.keys()):
param_dump.append((param_name, best_parameters[param_name]))
print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
print("****** FINISH",name," *****")
print("")
# Record the results
exp_name = name
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[best_train_accuracy,
best_valid_accuracy,
accuracy_score(y_test, y_test_pred),
best_train_roc_auc,
best_valid_roc_auc,
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
],4))
Best parameters: predictorC: 0.1 , predictorpenalty: l2 , predictor__tol: 1e-05
ConductGridSearch(classifiers[0],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Logistic Regression *****
Parameters:
C: (10, 1, 0.1, 0.01)
penalty: ('l1', 'l2', 'elasticnet')
tol: (0.0001, 1e-05)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters: predictor__C: 0.1 predictor__penalty: l2 predictor__tol: 1e-05 ****** FINISH Logistic Regression *****
Best parameters : predictormax_depth: 10 , predictormax_features: 10 , predictormin_samples_leaf: 5 , predictorn_estimators: 1000 , predictorn_iter_no_change: 10 , predictorsubsample: 0.8 , predictortol: 0.0001 , predictorvalidation_fraction: 0.2
ConductGridSearch(classifiers[2],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Gradient Boosting ***** Parameters: max_depth: [5, 10] max_features: [10, 15] min_samples_leaf: [3, 5] n_estimators: [1000] n_iter_no_change: [10] subsample: [0.8] tol: [0.01, 0.0001] validation_fraction: [0.2] Fitting 5 folds for each of 16 candidates, totalling 80 fits Cross validation with best estimator Fit and Prediction with best estimator
Best Parameters: predictor__max_depth: 10 predictor__max_features: 10 predictor__min_samples_leaf: 5 predictor__n_estimators: 1000 predictor__n_iter_no_change: 10 predictor__subsample: 0.8 predictor__tol: 0.0001 predictor__validation_fraction: 0.2 ****** FINISH Gradient Boosting *****
Best parameters : predictormax_depth: 10 , predictormax_features: 10 , predictormin_samples_leaf: 5 , predictorn_estimators: 1000 , predictorn_iter_no_change: 10 , predictorsubsample: 0.8 , predictortol: 0.0001 , predictorvalidation_fraction: 0.2
ConductGridSearch(classifiers[3],cnfmatrix,fprs,tprs,precisions,recalls)
****** START XGBoost ***** Parameters: colsample_bytree: [0.2, 0.5] eta: [0.01, 0.1] learning_rate: [0.01, 0.1] max_depth: [3, 5] n_estimators: [300, 500] Fitting 5 folds for each of 32 candidates, totalling 160 fits Cross validation with best estimator Fit and Prediction with best estimator
Best Parameters: predictor__colsample_bytree: 0.5 predictor__eta: 0.01 predictor__learning_rate: 0.1 predictor__max_depth: 5 predictor__n_estimators: 500 ****** FINISH XGBoost *****
Best parameters : predictormax_depth: 10 , predictormax_features: 10 , predictormin_samples_leaf: 5 , predictorn_estimators: 1000 , predictorn_iter_no_change: 10 , predictorsubsample: 0.8 , predictortol: 0.0001 , predictorvalidation_fraction: 0.2
ConductGridSearch(classifiers[1],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Support Vector *****
Parameters:
C: (0.001, 0.01)
degree: (4, 5)
gamma: (0.01, 0.1, 1)
kernel: ('rbf', 'poly')
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters: predictor__C: 0.01 predictor__degree: 4 predictor__gamma: 1 predictor__kernel: rbf ****** FINISH Support Vector *****
Best Parameters : predictorC: 0.1 , predictorpenalty: l2 , predictor__tol: 0.0001
for (name, classifier,feature_sel) in classifiers[0]:
# Print classifier and parameters
print('****** START', name,'*****')
parameters = params_grid[name]
print("Parameters:")
for p in sorted(parameters.keys()):
print("\t"+str(p)+": "+ str(parameters[p]))
# generate the pipeline based on the feature selection method
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("PCA",PCA(0.95)),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
# Best estimator score
best_train = pct(grid_search.best_score_)
# Best train scores
print("Cross validation with best estimator")
best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics,
return_train_score=True, n_jobs=-1)
#get all scores
best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)
valid_time = np.round(best_train_scores['score_time'].mean(),4)
best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)
(t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])
#test and Prediction with whole data
# Best estimator fitting time
print("Fit and Prediction with best estimator")
start = time()
model = grid_search.best_estimator_.fit(X_train, y_train)
train_time = round(time() - start, 4)
# Best estimator prediction time
start = time()
y_test_pred = model.predict(X_test)
test_time = round(time() - start, 4)
# Collect the best parameters found by the grid search
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
param_dump = []
for param_name in sorted(params.keys()):
param_dump.append((param_name, best_parameters[param_name]))
print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
print("****** FINISH",name," *****")
print("")
# Record the results
exp_name = "Logistic Regression with PCA"
exp_name = exp_name
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[best_train_accuracy,
best_valid_accuracy,
accuracy_score(y_test, y_test_pred),
best_train_roc_auc,
best_valid_roc_auc,
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
],4))
****** START Logistic Regression *****
Parameters:
C: (10, 1, 0.1, 0.01)
penalty: ('l1', 'l2', 'elasticnet')
tol: (0.0001, 1e-05)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
predictor__C: 0.1
predictor__penalty: l2
predictor__tol: 0.0001
****** FINISH Logistic Regression *****
# plot feature importance by their ranking for each model
for name in names[1:-1]:
plt.figure(figsize=(10,10), dpi= 80)
features_df = features_df = features_list[name].sort_values(['feature_importance','feature_name'], ascending=[False, False])
sortedNames = np.array(features_df)[0:25, 0]
sortedImportances = np.array(features_df)[0:25, 1]
plt.title('Feature Importance - ' + name)
plt.barh(range(len(sortedNames)), sortedImportances, color='g', align='center')
plt.yticks(range(len(sortedNames)), sortedNames)
plt.xlabel('Low Importance High Importance')
plt.grid()
plt.show()
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names,rotation=90)
pyplot.grid()
pyplot.show()
# roc curve fpr, tpr for all classifiers
plt.plot([0,1],[0,1], 'k--')
for i in range(len(names)-1):
plt.plot(fprs[i],tprs[i],label = names[i] + ' ' + str(scores[i]))
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Receiver Operating Characteristic')
plt.show()
# precision recall curve for all classifiers
for i in range(len(names)-1):
plt.plot(recalls[i],precisions[i],label = names[i])
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title('Precision-Recall Curve')
plt.show()
# plot confusion matrix for all classifiers
f, axes = plt.subplots(1, len(names), figsize=(30, 8), sharey='row')
for i in range(len(names)):
disp = ConfusionMatrixDisplay(cnfmatrix[i], display_labels=['0', '1'])
disp.plot(ax=axes[i], xticks_rotation=0)
disp.ax_.set_title("Confusion Matrix - " + names[i])
disp.im_.colorbar.remove()
disp.ax_.set_xlabel('')
if i!=0:
disp.ax_.set_ylabel('')
f.text(0.4, 0.1, 'Predicted label', ha='left')
plt.subplots_adjust(wspace=0.10, hspace=0.1)
f.colorbar(disp.im_, ax=axes)
plt.show()
pd.set_option('display.max_colwidth', None)
expLog
| exp_name | Train Acc | Valid Acc | Test Acc | Train AUC | Valid AUC | Test AUC | |
|---|---|---|---|---|---|---|---|
| 0 | Baseline_14_features | 0.9210 | 0.9195 | 0.9209 | 0.7725 | 0.8078 | 0.6971 |
| 1 | Logistic Regression | 0.6898 | 0.6901 | 0.6804 | 0.7587 | 0.7590 | 0.6867 |
| 2 | XGBoost | 0.9552 | 0.9224 | 0.8342 | 0.9854 | 0.9654 | 0.6363 |
| 3 | Gradient Boosting | 0.9964 | 0.9840 | 0.9133 | 0.9996 | 0.9988 | 0.6604 |
| 4 | Support Vector | 0.5018 | 0.4958 | 0.9014 | 0.9914 | 0.9874 | 0.5154 |
| 5 | Logistic Regression | 0.6947 | 0.7000 | 0.6723 | 0.7711 | 0.7708 | 0.7118 |
| 6 | Logistic Regression with PCA | 0.6947 | 0.7000 | 0.6723 | 0.7711 | 0.7708 | 0.7118 |
final_best_clf['Logistic Regression']['predictor'][0]
LogisticRegression(C=0.1, random_state=42, solver='saga', tol=1e-05)
%%time
np.random.seed(42)
model_selection = ['Logistic Regression','Gradient Boosting','XGBoost']
print("Classifier with parameters")
final_estimators = []
for i,clf in enumerate(model_selection):
model = final_best_clf[clf]['predictor'][0]
print(i+1, " :",model)
final_estimators.append((clf,make_pipeline(data_prep_pipeline,
RFE(estimator=model,n_features_to_select=features_used, step=feature_selection_steps),
model)))
Classifier with parameters
1 : LogisticRegression(C=0.1, random_state=42, solver='saga', tol=1e-05)
2 : GradientBoostingClassifier(max_depth=10, max_features=10, min_samples_leaf=3,
n_estimators=1000, n_iter_no_change=10,
random_state=42, subsample=0.8,
validation_fraction=0.2, warm_start=True)
3 : XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.5,
early_stopping_rounds=None, enable_categorical=False, eta=0.01,
eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.1, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=500,
n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=42,
reg_alpha=0, ...)
CPU times: user 4.77 ms, sys: 3.21 ms, total: 7.98 ms
Wall time: 6.08 ms
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
test_class_scores = voting_classifier.predict_proba(X_kaggle_test)[:, 1]
test_class_scores[0:10]
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
submit_df.to_csv("submission.csv",index=False)
model = final_best_clf[model_selection[2]]['predictor'][0]
XG_Pipeline = Pipeline([
("preparation", data_prep_pipeline),
('RFE', RFE(estimator=model,n_features_to_select=features_used, step=feature_selection_steps)),
('XGB', model)])
XG_Pipeline.fit(final_X_train, final_y_train)
class_scores = XG_Pipeline.predict_proba(X_kaggle_test)[:, 1]
# Submission dataframe
submit_df_1 = datasets["application_test"][['SK_ID_CURR']]
submit_df_1['TARGET'] = class_scores
submit_df_1.to_csv("submission1.csv",index=False)
model = final_best_clf[model_selection[0]]['predictor'][0]
LR_Pipeline = Pipeline([
("preparation", data_prep_pipeline),
('RFE', RFE(estimator=model,n_features_to_select=features_used, step=feature_selection_steps)),
('XGB', model)])
LR_Pipeline.fit(final_X_train, final_y_train)
class_scores = LR_Pipeline.predict_proba(X_kaggle_test)[:, 1]
# Submission dataframe
submit_df_2 = datasets["application_test"][['SK_ID_CURR']]
submit_df_2['TARGET'] = class_scores
submit_df_2.to_csv("submission2.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "baseline submission - phase-2"
!kaggle competitions submit -c home-credit-default-risk -f submission2.csv -m "Logistic Regression submission"
import torch
import torchvision
import torch.utils.data
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
import warnings
warnings.filterwarnings("ignore")
# is there a GPU availabale. If available use it
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Assuming that we are on a CUDA machine, this should print a CUDA device:
print(device)
cuda:0
import random
# Set seeds
torch.manual_seed(42)
random.seed(42)
np.random.seed(42)
import torch.nn as nn
import torch.nn.functional as F
from torchsummary import summary
CUDA_LAUNCH_BLOCKING=1
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
cpu
!pip install pytorch-ignite
Collecting pytorch-ignite
Downloading pytorch_ignite-0.4.8-py3-none-any.whl (251 kB)
|████████████████████████████████| 251 kB 6.6 MB/s eta 0:00:01
Requirement already satisfied: torch<2,>=1.3 in /usr/local/lib/python3.7/dist-packages (from pytorch-ignite) (1.11.0+cu113)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch<2,>=1.3->pytorch-ignite) (4.2.0)
Installing collected packages: pytorch-ignite
Successfully installed pytorch-ignite-0.4.8
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
from torch.autograd import Variable
import numpy as np
indices = np.arange(50000)
np.random.shuffle(indices)
#MNist has a separate training and test sets that are downloaded as separate files
train_dataset =datasets.MNIST('../mnist_data', download=True, train=True,
transform=transforms.Compose([transforms.ToTensor(), # first, convert image to PyTorch tensor
transforms.Normalize((0.1307,), (0.3081,)) # normalize inputs
]))
train_set, val_set = torch.utils.data.random_split(train_dataset, [50000, 10000])
# download and transform train dataset
train_loader = torch.utils.data.DataLoader(train_set,
batch_size=16,
shuffle=False, #can not shuffle and sample at the same time!
sampler=torch.utils.data.SubsetRandomSampler(indices[:10_000]))
# set valdation dataset loadeer
valid_loader = torch.utils.data.DataLoader(val_set,
batch_size=16,
shuffle=False, #can not shuffle and sample at the same time!
)
# download and transform test dataset
test_loader = torch.utils.data.DataLoader(datasets.MNIST('../mnist_data',
download=True,
train=False,
transform=transforms.Compose([
transforms.ToTensor(), # first, convert image to PyTorch tensor
transforms.Normalize((0.1307,), (0.3081,)) # normalize inputs
])),
batch_size=16,
shuffle=True)
# Inspecting the dataloader
print(test_loader.batch_size)
print(train_loader.sampler)
dataiter = iter(train_loader)
inputs, target = dataiter.next()
len(inputs)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../mnist_data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting ../mnist_data/MNIST/raw/train-images-idx3-ubyte.gz to ../mnist_data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../mnist_data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting ../mnist_data/MNIST/raw/train-labels-idx1-ubyte.gz to ../mnist_data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../mnist_data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting ../mnist_data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../mnist_data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../mnist_data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting ../mnist_data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../mnist_data/MNIST/raw 16 <torch.utils.data.sampler.SubsetRandomSampler object at 0x7f1561ed3c50>
16
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# functions to show an image
def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# get some random training images
dataiter = iter(train_loader)
images, labels = dataiter.next()
# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % labels[j].numpy() for j in range(len(labels))))
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
1 0 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 1 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0 1 1 1 1 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 0 0 1 1 1 0 1 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 1 1 0 1 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 1 1 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 0 1 1 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 1 1 1 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 1 1 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 1 0 0 1 0 1 1 1 0 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 0 1 0 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 0 1 0 0 1 0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 0 0 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1
from tensorflow.keras.callbacks import TensorBoard
from torch.utils.tensorboard import SummaryWriter
# Importing necessary libraries again including pytorch
# Setting GPU usage
# Standard libraries
import os
import math
import numpy as np
import time
import pandas as pd
import zipfile
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')
# Progress bar
from tqdm.notebook import tqdm
# Import PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
print("Using torch", torch.__version__)
# Import plot
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Assuming that we are on a CUDA machine, this should print a CUDA device:
print(f"We are working on a {device} device")
We are working on a cpu device
X_train_tfm = data_prep_pipeline.fit_transform(X_train, y_train)
X_valid_tfm = data_prep_pipeline.fit_transform(X_valid, y_valid)
X_test_tfm = data_prep_pipeline.fit_transform(X_test, y_test)
X_kaggle_test_tfm = data_prep_pipeline.transform(X_kaggle_test)
from torch.utils.data import Dataset, DataLoader
from ignite.engine import create_supervised_trainer, create_supervised_evaluator, Events
from ignite.metrics import Precision, Recall, Accuracy, Loss
class HCDR_Trainset(Dataset):
def __init__(self):
# Initialize data, download, etc.
# read in data and convert to tensor
X_train_tf = torch.FloatTensor(X_train_tfm)
y_train_tf = torch.LongTensor(y_train.values)
self.n_samples = X_train_tf.shape[0]
# here the first column is the class label, the rest are the features
self.x_data = X_train_tf # size [n_samples, n_features]
self.y_data = y_train_tf # size [n_samples, 1]
# support indexing such that dataset[i] can be used to get i-th sample
def __getitem__(self, index):
return self.x_data[index], self.y_data[index]
# we can call len(dataset) to return the size
def __len__(self):
return self.n_samples
class HCDR_Testset(Dataset):
def __init__(self):
# Initialize data, download, etc.
# read in data and convert to tensor
X_test_tf = torch.FloatTensor(X_test_tfm)
y_test_tf = torch.LongTensor(y_test.values)
self.n_samples = X_test_tf.shape[0]
# here the first column is the class label, the rest are the features
self.x_data = X_test_tf # size [n_samples, n_features]
self.y_data = y_test_tf # size [n_samples, 1]
# support indexing such that dataset[i] can be used to get i-th sample
def __getitem__(self, index):
return self.x_data[index], self.y_data[index]
# we can call len(dataset) to return the size
def __len__(self):
return self.n_samples
trainset = HCDR_Trainset()
train_loader = DataLoader(trainset, batch_size=10000, shuffle=True)
testset = HCDR_Testset()
test_loader = DataLoader(testset, batch_size=1000, shuffle=False)
from torch.utils.data import DataLoader, TensorDataset
from torch import Tensor
kaggle_test_dataset = TensorDataset( Tensor(np.array(X_kaggle_test_tfm)))
kaggle_test_loader = DataLoader(kaggle_test_dataset, shuffle=False, batch_size=1000)
x,y = next(iter(train_loader))
x.shape
#y.shape
torch.Size([6748, 44])
x_test,y_test = next(iter(test_loader))
x.shape
torch.Size([1000, 43])
y_test.dtype
torch.int64
mlp_log = pd.DataFrame(columns=["exp_name",
"Dataset",
"CXE Loss",
"Accuracy",
"ROC AUC Score"
])
# Creating a very simple single layer model using sequential API
input_features = 43
out_features= 2
model_basic = torch.nn.Sequential(
torch.nn.Linear(input_features, out_features),
)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_basic.parameters(), lr=0.001)
writer = SummaryWriter()
for epoch in range(200):
# Forward pass:
y_pred = model_basic(x_test)
# Compute and print loss.
loss = loss_fn(y_pred, y_test)
if epoch % 10 == 0:
print(f"Epoch:{epoch}, CXE Loss: {loss.item():.9}")
writer.add_scalar("Loss/train", loss, epoch)
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()
y_pred_class = torch.argmax(y_pred, dim=1)
probs = F.softmax(y_pred, dim=1)
y_probs = probs.detach().numpy()[:, 1]
score = accuracy_score(y_test, y_pred_class)
auc_score = roc_auc_score(y_test, y_probs)
mlp_log.loc[len(mlp_log)] = [f"NN Model without hidden layer"] + [f"Test"] + list([loss.item(), score, auc_score])
print(model_basic)
Epoch:0, CXE Loss: 0.550622046 Epoch:10, CXE Loss: 0.498641461 Epoch:20, CXE Loss: 0.45413202 Epoch:30, CXE Loss: 0.416984469 Epoch:40, CXE Loss: 0.386560529 Epoch:50, CXE Loss: 0.36190334 Epoch:60, CXE Loss: 0.342009395 Epoch:70, CXE Loss: 0.325967491 Epoch:80, CXE Loss: 0.313004822 Epoch:90, CXE Loss: 0.302490175 Epoch:100, CXE Loss: 0.293919027 Epoch:110, CXE Loss: 0.286893249 Epoch:120, CXE Loss: 0.281100184 Epoch:130, CXE Loss: 0.276294529 Epoch:140, CXE Loss: 0.272283465 Epoch:150, CXE Loss: 0.268915057 Epoch:160, CXE Loss: 0.266068965 Epoch:170, CXE Loss: 0.263649523 Epoch:180, CXE Loss: 0.261580467 Epoch:190, CXE Loss: 0.259800434 Sequential( (0): Linear(in_features=43, out_features=2, bias=True) )
%load_ext tensorboard
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 979), started 0:00:49 ago. (Use '!kill 979' to kill it.)
print(score)
0.922
# Deeper network with a non-linear activation function and 2 hidden linear layers
input_features = 43
hidden = 20
out_features= 2
model_2NN = torch.nn.Sequential(
nn.Linear(input_features, hidden),
nn.ReLU(), #Nonlinear activation function.
nn.Linear(hidden, out_features)
)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_2NN.parameters(), lr=0.001)
writer = SummaryWriter()
for epoch in range(200):
# Forward pass:
y_pred = model_2NN(x_test)
# Compute and print loss.
loss = loss_fn(y_pred, y_test)
if epoch % 10 == 0:
print(f"Epoch:{epoch}, CXE Loss: {loss.item():.9}")
writer.add_scalar("Loss/train", loss, epoch)
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()
y_pred_class = torch.argmax(y_pred, dim=1)
probs = F.softmax(y_pred, dim=1)
y_probs = probs.detach().numpy()[:, 1]
score = accuracy_score(y_test, y_pred_class)
auc_score = roc_auc_score(y_test, y_probs)
mlp_log.loc[len(mlp_log)] = [f"MLP with 1 hidden layer"] + [f"Test"] + list([loss.item(), score, auc_score])
print(model_2NN)
Epoch:0, CXE Loss: 0.733312666 Epoch:10, CXE Loss: 0.653566539 Epoch:20, CXE Loss: 0.577171385 Epoch:30, CXE Loss: 0.502379894 Epoch:40, CXE Loss: 0.430886418 Epoch:50, CXE Loss: 0.368706554 Epoch:60, CXE Loss: 0.321524501 Epoch:70, CXE Loss: 0.290671974 Epoch:80, CXE Loss: 0.272667259 Epoch:90, CXE Loss: 0.262763023 Epoch:100, CXE Loss: 0.2572923 Epoch:110, CXE Loss: 0.253934562 Epoch:120, CXE Loss: 0.251583368 Epoch:130, CXE Loss: 0.249761522 Epoch:140, CXE Loss: 0.24825798 Epoch:150, CXE Loss: 0.246988103 Epoch:160, CXE Loss: 0.245859906 Epoch:170, CXE Loss: 0.24484989 Epoch:180, CXE Loss: 0.243915856 Epoch:190, CXE Loss: 0.243036643 Sequential( (0): Linear(in_features=43, out_features=20, bias=True) (1): ReLU() (2): Linear(in_features=20, out_features=2, bias=True) )
tensorboard --logdir=runs
Reusing TensorBoard on port 6006 (pid 979), started 0:09:49 ago. (Use '!kill 979' to kill it.)
mlp_log
| exp_name | Dataset | CXE Loss | Accuracy | ROC AUC Score | |
|---|---|---|---|---|---|
| 0 | NN Model without hidden layer | Test | 0.258405 | 0.922 | 0.726055 |
| 1 | MLP with 1 hidden layer | Test | 0.242296 | 0.922 | 0.767201 |
!kaggle competitions submit -c home-credit-default-risk -f submission2.csv -m "MLP submission"
For this phase of the project, you will need to submit a write-up summarizing the work you did. The write-up form is available on Canvas (Modules-> Module 12.1 - Course Project - Home Credit Default Risk (HCDR)-> FP Phase 2 (HCDR) : write-up form ). It has the following sections:
The main goal of this project is to use a machine learning model for historical loan application data to predict if the customer will be able to repay a loan.
Phase 2:
The main aim of this phase is to include data modeling and perform feature engineering. As an extension to the Visual EDA driven feature sampling and baseline model development, the focus for this phase included data modeling. So here, data modeling is used to combine all the available data sets and feature engineering is done considering polynomial, aggregated, numerical and categorical experiments. Should do experimental analysis for hyper-parameter tuning for Logistic Regression, XGBoost. We conducted experiments using both original imbalanced data as well as resampled data.
What you did (main experiments) The data sets consist of three levels of data, So, firstly we have combined combinations of different levels of data and correlations were considered from those combinations. So, multiple feature families were created by considering categorical, numerical, aggregated & polynomial features implementations on multiple feature families. Then these features were given as input to the pipeline and the best feature was chosen. In this phase feature engineering is very important because there is a huge amount of data and various feature families can be created and choosing the best impacted feature family is a huge task so domain knowledge is very important. Then after choosing the best feature family, these machine learning models are used for pipeline (Baseline model, XGBoost and PL Model) and hyper parameter tuning is done on that.
What were your results/findings (best pipeline and the corresponding public, private scores) Our result for this phase shows that the best performing algorithm was XGBoost. Which has the best AUC ROC score as 71.85%. The lowest performing algorithm was the SVM modelw. The best score in Kaggle submission out of all four submissions was 0.72720 for private and 0.73006 for public.
Problems you are tackling: The main problem faced while working on this project is the data. This is a very large data set. Secondly, there is a lot of feature engineering resulting in very less increase in the test accuracy.
Phase 3
We have explored the concept of Deep learning which is about learning from past data using artificial neural networks with multiple hidden layers (2 or more hidden layers). Deep neural networks un crumple complex representation of data step-by-step, layer-by-layer (hence multiple hidden layers) into a clear representation of the data. Artificial neural networks having one hidden layer apart from input and output layer is called as multi-layer perceptron (MLP) network.
We have added Single Layer neural network and multi layer neural network model.
We have done resampling on the data to balance the data points from both the classes.The deep learning Kaggle score fell short for the ensemble model. This results clearly show that Neural network may not be a good choice for supervised binary classification always. Simple methods like Logistic Regression and Gradient methods like XGBoost did out perform Neural Network model.
We used XGBoost to predict loan default and were able to reach an AUC score of over 0.72 on our Kaggle submission. After focusing on exploratory data analysis. Along with building upon our additional features and developing boosted models, In phase 3 our team implemented multiple Multi Linear Perceptron models, experimenting with different architectures and activation types. Our top performing MLP model was built using Pytorch and the test ROC-AUC score is 0.767 which is the highest among all the models.
The complete given dataset consists of 7 .csv files; that is they are 7 tables. In these 7 tables, The application train/test table is the primary table and the remaining 6 tables are the secondary/supporting tables.
Primary Tables: The Application train and Application test are the main tables which consist of information about each of the applications at Home credit. The primary key/feature of this table is SK_ID_CURR which is used to uniquely identify each entry of a loan. Training Application (application_train): Here the training application data comes from TARGET which has two: 0 or 1. 0 indicates that the loan was repaid perfectly without any problems or delay. 1 indicates that the loan was not repaid and that there was some difficulty paying back the loan amount or the installments were paid back with some delay in time. In this file/table, the number of variables are 122 and the number of data entries are 307,511. Testing Application (application_test): The testing application consists of the same features as the training application except the TARGET feature. In this file/table, the number of variables are 121 and the number of data entries are 48,744.
Secondary Tables: The following are the 6 secondary tables:
Bureau(bureau.csv): The bureau table consists of the client's previous credits which are received from the other financial institutions before applying for this loan. Each previous credit has a record/row and each loan in the application data can have multiple previous credits. The application_{train/test} table is joined with the bureau table by a primary key SK_ID_CURR. The number of variables are 17 and the number of data entries are 1,716,428.
Bureau Balance (bureau_balance.csv): The bureau balance table includes the entries of monthly balances of the client's previous credits which are received from the other financial institutions. The bureau table is joined with this table by using SK_ID_BUREAU where SK_ID_BUREAU is unique in the bureau table where has it is a foreign key in bureau_balance table which creates a many-t0-many relation. The number of variables are 3 and the number of data entries are 27,299,925.
Previous Application (previous_application.csv): This table consists of the previous applications made by the customers at Home Credit. The table can be joined with the primary table by SK_ID_CURR which means one row of each previous application forms a many-to-many relation. The total number of variables are The number of data entries are 1,670,214. There are four types of contracts:
- Consumer loan(POS – Credit limit given to buy consumer goods)
- Cash loan(Client is given cash)
- Revolving loan(Credit)
- XNA (Contract type without values)
Pos Cash Balance (POS_CASH_Balance.csv): This table consists of a monthly balance snapshot of a previous point of sale or cash loan that a customer has at Home Credit. This table can be joined with previous_application table using the primary key SK_ID_PREV, which means there is one row for each monthly balance having a many-to-one relationship with the previous_application table. The number of variables are 8 and the number of data entries are 10,001,358.
Installements Payments (installements_payments.csv): This table consists of past payment data for each installments of previous credits in Home Credit related to loans in our sample.This table can be joined with previous_application table using the primary key SK_ID_PREV, which means there is one row for each monthly balance having a many-to-one relationship with the previous_application table. The number of variables are 23 and the number of data entries are 3,840,312.
Credit Card Balance (credit_card_balance.csv): This table consists of monthly balances of client’s previous credit loans in Home Credit. There is one row for every made payment and one row for every missed payment.This table can be joined with previous_application table using the primary key SK_ID_PREV, which means there is one row for each payment or missed payment having a many-to-one relationship with the previous_application table. The number of variables are 8 and the number of data entries are 13,605,401.
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. So, Home Credit(An international non-bank financial institution) strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data including telco and transactional information to predict their clients' repayment abilities. Home Credit primarily focuses on lending people money regardless of their credit history. So, we have a dataset on Kaggle with the objective of identifying and solving unfair loan rejection of Home Credit by not just considering the credit history. The main aim of this project is to predict the applicant/customer behaviors on loan repayment using Machine learning model. So firstly we will create a balanced dataset by handling missing values and doing correlational analysis on the dataset given. Then we create a set of final features including Numerical and Categorical feature pipeline based on correlational score. Then the data pipeline and baseline LR model is trained and this baseline model is evaluated. After this process the best LR model is chosen and then based on this and various performance metrics, the best prediction is made. The results of the machine learning pipelines are measured using Confusion matrix, Precision, F1 score, Accuracy Score, Area under ROC curve and recall. Businesses will be able to use the output of the model to identify if the loan is at risk to default. The new model built ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
After focusing on exploratory data analysis, feature selection and preliminary modelling, in this last phase of the HCDR project, our work focused on three parts Single Layer Neural network Multi Layer Neural Network MLP-model building and applying classification Technique Single Layer Neural network In it we transform data using data pipeline and convert it into Tensor for neural network pipeline.Linear layer is used to create the probability of prediction. Multi Layer Neural Network The model contains 2 linear layers and one hidden layer with Relu function. MLP-model building and applying classification Technique In phase 3 of the project, our goal is to build a multi-layer perception (MLP) classification model in Pytorch and use Tensorboard to monitor real-time training results. phase2, we applied the algorithm nn.Linear() and the activation function nn.ReLu() in the MLP model with 43 initial transformed features and 2 final output features. To visualise the real-time training results, TensorBoard was introduced to monitor the training loss (CrossEntropyLoss) and accuracy of each epoch. The test accuracy for the MLP with and without the hidden layer is 0.922. The ROC AUC score has increased for the MLP model.
This is the workflow which we have used for our project.
After focusing on exploratory data analysis, feature selection and preliminary modelling, in this last phase of the HCDR project, our work focused on three parts
Single Layer Neural network
Multi Layer Neural Network
MLP-model building and applying classification Technique
Single Layer Neural network
In it we transform data using data pipeline and convert it into Tensor for neural network pipeline.Linear layer is used to create the probability of prediction.
Multi Layer Neural Network
The model contains 2 linear layers and one hidden layer with Relu function.
MLP-model building and applying classification Technique
In phase 3 of the project, our goal is to build a multi-layer perception (MLP) classification model in Pytorch and use Tensorboard to monitor real-time training results. phase2, we applied the algorithm nn.Linear() and the activation function nn.ReLu() in the MLP model with 43 initial transformed features and 2 final output features. To visualise the real-time training results, TensorBoard was introduced to monitor the training loss (CrossEntropyLoss) and accuracy of each epoch. The test accuracy for the MLP with and without the hidden layer is 0.922. The ROC AUC score has increased for the MLP model is 0.767.
Data Leakage is one of the leading machine learning errors. Data leakage in machine learning happens when the data that we are used to training a machine learning algorithm is having the information which the model is trying to predict, this results in unreliable and bad prediction outcomes after model deployment. In phase 2 and 3 we have handled the missing values present in the data by replacing few of them with mean and median values. We have also split the data into train, test and valid sets using fit transform. The data is standardized using StandardScalar and resampled the data to have equal number of points from both the classes. With all these factors into consideration, there is no considerable amount of the data leakage present in our modeled pipelines.
We have deployed pipelines to prevent data leaking during numeric and categorical feature preparation.
Phase-1 Logistic Regression Model is used as a baseline model since it is simple to develop and efficient. A logistic regression model does not require a lot of processing resources to train.
Phase-2 We'll look into different classification models to see if we can improve our forecast. Our main focus is on boosting algorithms, which are said to be extremely efficient and relatively fast. Gradient boosting, XGBoost, Light GBM, and SVM were the preferred techniques. The following are the reasons for selecting the models indicated. By generating an ensemble of weak predictors, Gradient Boosting creates a better predictive model. XGBoost is one of the quickest gradient boosted tree implementations and internally handles missing values.In many circumstances, LightGBM produces results that are more effective and faster than XGBoost while using less memory. When linear separation is required, SVM performs similarly to logistic regression, and depending on the kernel used, it also performs well with non-linear boundaries. Depending on the kernel, SVM is prone to overfitting/training difficulties. A Voting Classifier is a machine learning model that learns from an ensemble of different models and predicts an output based on the highest probability of the result being the target class.
Phase-3 We have explored the concept of Deep learning which is about learning from past data using artificial neural networks with multiple hidden layers (2 or more hidden layers). Deep neural networks un crumple complex representation of data step-by-step, layer-by-layer (hence multiple hidden layers) into a clear representation of the data. Artificial neural networks having one hidden layer apart from input and output layer is called as multi-layer perceptron (MLP) network. We have added Single Layer neural network and multi layer neural network model. We have done resampling on the data to balance the data points from both the classes.The deep learning Kaggle score fell short for the ensemble model. This results clearly show that Neural network may not be a good choice for supervised binary classification always. Simple methods like Logistic Regression and Gradient methods like XGBoost did out perform Neural Network model.
All the details of results and screenshots are below. Overall this project, we have used various feature selection techniques on 183 highly correlated feature model. XGBoost and Logistic regression, both have almost same Public and private scores.
Neural Network Our Simple Neural network ROC score was 76.21% and multilayer neural network ROC score was 72.60%. By this we can conclude that simple network performed better than multilayer neural network. Deep learning model training on full dataset look very less time that compared to the other classifiers like Linear regression, XG Boost and etc.
We have used many classifiers as follows:
logistic Regression : This model was chosen as the baseline model trained with both balanced and imbalanced dataset with feature engineering. The training accuracy for this model 68.9% and test accuracy as 68%. A 75% ROC score resulted with best parameters for this model. The same model was run with PCA and the test ROC reduced to 69%.
Gradient Boosting : Boosting did help in achieving better results. The results were good enough to continue in implementing & evaluating other boosting models. Training accuracy of 99.6% and test accuracy of 91.3% was achieved in this model. Test ROC under the curve for this model came out to 66%
XGBoost : By far this model resulted in the best model. Both in terms of timing and accuracy for the selected features and balanced dataset. The accuracy of the training and test are 95.5% and test 83.4%. Test ROC under the curve is 63.6%.
SVM : This was the lowest performing model in our experiment. Even after hyper-tuning RBF and poly kernels the results were not promising. The ROC score achieved for this model is 51.5%.
Overall results of classifiers:
Results of Neural Network Model
Restate your project focus explain why it’s important. Make sure that this part of the conclusion is concise and clear. In HCDR project we are using Home Credit’s data to predict the customers who can repay the loan amount who has no credit history. This is intern, upscaling the livelihood by proving loans for people with low credit history. In the process of building Machine Learning model for risk detection of customers for loan repayment. The main objective is to identify the potential Defaulters based on the given data about the applicants. The probability of classification is essential because we want to be very sure when we classify someone as a Non-Defaulter, as the cost of making a mistake can be very high to the company. In this phase, after proving our hypothesis that tuned machine learning techniques can outperform baseline models to aid Home Credit in their evaluation of loan applications.
Restate your hypothesis (e.g., ML pipelines with custom features can accurately predict HCDR or Cats/Dogs) Feature engineering turned out to be the most crucial part for building the accurate classifier. Before Phase 2 that is before performing Feature engineering, the accuracy score was around 63%. But, after the feature engineering process, after feature aggregation, doing secondary and tertiary datasets, the accuracy increased to 73%. ML pipelines with custom features can accurately predict HCDR.
Summarize main points of your project: Remind your readers your key points. Phase 1: we had analyzed the features of dataset, compared each feature relation with the target, identified top 14 correlated features, analyzed the missing values and analyzed the distribution of each feature. We have chosen subset of features which are highly correlated based on the correlational analyses done. The result for this Kaggle submission has 73% accuracy which we believe is a very good start.
Phase 2: Simple baseline model, data modeling with feature aggregation, feature engineering, and using various data preprocessing pipeline both increased & reduced efficiency of models. Models used for prediction were Logistic Regression with PCA to handle multicollinearity, ensemble model approaches using gradient boosting, Xgboost, and SVM. Our best performing algorithm was XGBoost with the best AUC ROC score as 71.85%. The lowest performing model was SVM. Related ensemble models, Gradient Boosting has shown lower results with AUC ROC score 71.52% validation. Our best score in Kaggle submission out of all four submission was 0.72720 for private and 0.73006 for public.
Phase 3: Implemented Neural Network with simple and multilayer perceptron. We used XGBoost to predict loan default and were able to reach an AUC score of over 0.72 on our Kaggle submission. After focusing on exploratory data analysis. Along with building upon our additional features and developing boosted models, our team implemented multiple Multi Linear Perceptron models, experimenting with different architectures and activation types. Our top performing MLP model was built using Scikit-learn’s SVMN Classifier, yielding a validation AUC of 0.526, lower than our best performing, “soft voting classifier,” model.
Discuss the significance of your results
Our Simple Neural network ROC score was 76.21% and multilayer neural network ROC score was 72.60%. By this we can conclude that simple network performed better than multilayer neural network.
Our best performing model is Linear Regression model and XG Boost with almost similar test and train Kaggle accuracy. The overall highest Private score is 72.28% (XG Boost)and Public score is 72.61%(Logistic Regression).
Discuss the future of your project. Although we are done with the case study, there are still a couple of things which we had in mind, but couldn’t try due to time and resource constraints. One thing that we tried to implement, but couldn’t proceed further with was the Sequential Forward Feature Selection for selecting the best set of features. Given the number of features, this had a very high time-complexity and due to the unavailability of strong computational capabilities, we could not implement it. We believe that we haven’t utilized the concept of stacking appropriately in this case study. We can achieve an even better score by performing Stacking of diverse base classifiers, which would be trained on different sets of features, probably around 15–20 base classifiers which could give very strong results.
Read the following: